Opening — Why this matters now

AI agents can code, search, analyze data, and even plan holidays. But when the clock starts ticking, they often stumble. The latest benchmark from Shanghai Jiao Tong University — TPS-Bench (Tool Planning and Scheduling Benchmark) — measures whether large language model (LLM) agents can not only choose the right tools, but also use them efficiently in multi-step, real-world scenarios. The results? Let’s just say most of our AI “assistants” are better at thinking than managing their calendars.

Background — The hidden cost of ‘intelligence’

Benchmarks like SWE-Bench and AssistantBench have already proven that LLMs can solve problems when handed the right tools. Yet, real-world work — from business automation to logistics — rarely offers such luxury. Real tasks are compound: a single request may involve web search, data extraction, and follow-up synthesis. Efficiency isn’t about whether an agent can finish a task, but how many steps, tokens, and seconds it takes to get there.

That’s where TPS-Bench comes in. Built on 15 model context protocol (MCP) servers housing 141 tools, it offers 200 compound tasks across two difficulty tiers — TPS-Easy and TPS-Hard. Each task mimics real multi-step work: for instance, “check weather, find flights, and plan clothing.” The benchmark doesn’t just test completion; it tracks tool planning, scheduling order, token usage, and execution time.

Analysis — When planning meets scheduling

The findings read like a project management parable:

Model Task Completion (Hard) Avg Time (s) Tokens (Input) Scheduling Style
GLM-4.5 64.7% 217.8 12.6k Sequential (slow but sure)
GPT-4o 45.1% 76.8 7.2k Parallel (fast but fragile)
Qwen3-32B 56.7% 226.2 6.7k Balanced
Qwen3-1.7B (RL-trained) 33.1% 36.1 7.3k Optimized via RL

The contrast is telling: GLM-4.5 finishes most tasks but at glacial speed, while GPT-4o rushes through with errors. In essence, AI models today can plan well, but schedule poorly — a limitation that mirrors many human workplaces.

Findings — The anatomy of inefficiency

TPS-Bench’s ablation studies reveal two key tensions:

  1. Tool Selection vs. Efficiency: Using all tools increases context length and token usage dramatically — sometimes exceeding 50k tokens. Smart selection doesn’t improve accuracy much, but it saves time and memory, especially for smaller models.
  2. Sequential vs. Parallel Scheduling: Serial execution improves accuracy but costs time; parallel execution cuts latency but increases dependency errors. In short, it’s a classic quality-speed trade-off.

The researchers also introduced reinforcement learning (RL) — specifically, Group Relative Policy Optimization (GRPO) — to train Qwen3-1.7B for better scheduling. Even with only 100 training samples, it achieved a 6% higher completion rate and 14% lower latency, proving that small-scale RL fine-tuning can teach models to manage tool workflows more rationally.

Implications — From academic benchmarks to business automation

For companies deploying LLM-based agents — in customer service, data operations, or digital research — TPS-Bench exposes a simple truth: intelligence is nothing without coordination. The best-performing models in accuracy (like GLM-4.5) are often unusable in latency-sensitive workflows, while faster ones (like GPT-4o) miss critical steps.

The practical takeaway is clear:

  • Task orchestration (the “when” and “how”) is now as important as model accuracy.
  • Reinforcement learning offers a path to balancing precision and efficiency.
  • Benchmarks like TPS-Bench help quantify ROI trade-offs for AI integration — not just “can it do it?” but “is it worth the time and compute cost?”

Conclusion — The rise of time-aware AI

TPS-Bench quietly shifts the conversation from capability to coordination. As LLMs evolve into full-fledged agents, their success will hinge not just on reasoning but on scheduling, prioritization, and latency optimization — the cognitive equivalent of good project management.

It’s an ironic twist: the next leap in AI might not come from bigger brains, but from better calendars.

Cognaptus: Automate the Present, Incubate the Future.