Tool-Use

Agents on the Clock: How TPS-Bench Exposes the Time Management Problem in AI

A competent assistant can make a list. A useful assistant knows what must happen first. That distinction sounds small until an AI agent is asked to do something ordinary and annoyingly realistic: check a calendar, search the web, compare options, use a map, assemble a recommendation, and perhaps create a document at the end. None of those steps is exotic. The difficulty is that some of them can run in parallel, some must wait for earlier results, and some become nonsense if executed too early. This is less “genius at work” than “junior operations manager with access to too many browser tabs.” Naturally, it is where things get interesting. ...

When the Sandbox Thinks Back: Training AI Agents in Simulated Realities

Workflow software has a deeply unglamorous problem: reality keeps changing. A customer support agent may know the refund policy, but then the customer changes their address, the order record has a missing field, the tool returns a cryptic error, and the next API call requires a schema nobody mentioned in the demo. A spreadsheet agent may know how to summarise a table, but the file path is wrong, the calendar has a conflicting event, and the “obvious” action fails because the world, in its charmingly vindictive way, is not a benchmark prompt. ...

The Agent Olympics: How Toolathlon Tests the Limits of AI Workflows

Office work is not one task. It is a chain of small obligations pretending to be one task. “Check the homework submissions, download the attached Python files, run them, grade the students in Canvas, and use the latest submission if someone sent more than one.” That sounds like a normal administrative request. It is also a compact torture device for an AI agent. The agent must read email, handle attachments, inspect local files, run code, interpret results, map students to course records, update Canvas, and not confidently grade the wrong person. Easy, apparently, as long as nothing has to actually work. ...

The Esperanto of AI Agents: How the Agent Data Protocol Unifies a Fragmented Ecosystem

Every engineering team has met this problem: the useful data exists, but it lives in thirteen different shapes, three different tool conventions, two incompatible logs, and one heroic spreadsheet that nobody dares to open. AI agents have the same disease, only with more acronyms. The paper behind the Agent Data Protocol, or ADP, argues that large-scale supervised fine-tuning of AI agents has been held back less by a lack of data than by a lack of shared representation.1 Agent datasets already exist for coding, software engineering, web browsing, API use, operating-system interaction, and general tool use. The difficulty is that each one tends to encode actions, observations, tool calls, web states, messages, and execution feedback in its own local dialect. Naturally, every dataset is special. How convenient for nobody. ...

Deep Thinking, Dynamic Acting: How DeepAgent Redefines General Reasoning

Tools are where agent demos go to die. The pitch is usually elegant. Give the model a goal, attach a few APIs, let it reason, and watch the automation glide across systems like a tiny consultant with no calendar conflicts. Then the real world appears: too many tools, unclear documentation, stale context, partial failures, long interaction histories, and the occasional API response that seems to have been designed by someone settling a personal score. ...

Backtrack to Breakthrough: Why Great AI Agents Revisit

Search is easy. Knowing when to go back is harder. That is the useful irritation inside GSM-Agent, a new benchmark for studying agentic reasoning under controlled conditions.1 The paper takes grade-school maths problems from GSM8K, removes the premises from the prompt, hides those premises in a searchable document database, and asks an LLM agent to recover the facts before solving the problem. The arithmetic is not supposed to be impressive. That is the point. If a model fails here, we cannot calmly blame differential geometry, PhD-level law, or some mysteriously adversarial enterprise workflow. The agent simply did not find and use the facts. ...

Lost in the Long Game: What UltraHorizon Reveals About Agent Failure at Scale

Budget is the most comforting word in enterprise AI. Give the agent a bigger context window. Give it more tool calls. Give it more time. Give it a notebook, a browser, a Python interpreter, a reminder to “think step by step,” and perhaps a small motivational speech about being thorough. Surely the system will become more reliable. ...

Paths > Outcomes: Measuring Agent Quality Beyond the Final State

A calendar assistant creates the right meeting. A compliance agent files the right flag. A robotic controller moves the right object. Everyone applauds, because the final state is correct. Then someone checks the logs. The calendar assistant created, deleted, recreated, and re-notified the same meeting. The compliance agent skipped the required policy check and jumped straight to enforcement. The robot got the object into place only after executing a step that would have been unsafe if the power had cut out halfway through. The destination was fine. The route was a mess. In enterprise automation, this is not a philosophical distinction. It is the difference between “the demo worked” and “legal now wants a meeting.” ...

Tool Wars, Protocol Peace: What MCP‑AgentBench Really Measures

A procurement team does not buy an AI agent because it can recite the word “interoperability” with theatrical confidence. It buys the agent because the thing can use tools, collect data, combine results, and stop before it bankrupts the token budget. That is the useful way to read MCP-AgentBench, a new benchmark for evaluating language agents inside the Model Context Protocol ecosystem.1 The paper is not just another leaderboard with a fresh coat of protocol paint. Its more interesting result is harsher: MCP gives agents a common integration layer, but it does not make them competent tool users. Compatibility is plumbing. Competence is orchestration. ...

Tool Time, Any Time: Inside RLFactory’s Plug‑and‑Play RL for Multi‑Turn Tool Use

Tool calls are where agent demos stop being cute. A chatbot can talk through a task all day. A working agent has to search, query, execute, verify, retry, and sometimes discover that the tool it politely called has returned a malformed answer after making everyone wait. That is the difference between “reasoning about work” and doing work. The former gives you fluent paragraphs. The latter gives you latency, interface contracts, timeout handling, reward ambiguity, and a suspicious number of JSON parsing errors. Glamorous, naturally. ...