Tool-Use

Scaling the Sandbox: When LLM Agents Need Better Worlds

Sandbox is a comforting word. It sounds safe, contained, childlike. Put an AI agent in a sandbox and let it practice. Nothing catches fire. Nobody accidentally cancels a real flight. No production database wakes up with 37 mysterious refund requests and a very confused compliance officer. The problem is that most agent sandboxes are either too fake to teach anything, too manual to scale, or too close to production to be relaxing. The agent has to learn how to navigate persistent state, business rules, incomplete user information, tool failures, and multi-step dependencies. A static API-call dataset does not teach that. A role-playing LLM pretending to be the environment may hallucinate the rules. A hand-built benchmark is useful, but expensive to multiply. ...

From Tokens to Topology: Teaching LLMs to Think in Simulink

A model engineer asks for a small change: add a temperature sensor between a fuel-cell stack and a pump-control input. Easy request. Annoying execution. The assistant must find the right Simscape block, use the correct library path, respect physical ports, avoid breaking the existing topology, and produce a model that actually compiles. ...

Let It Flow: ROME and the Economics of Agentic Craft

A Firewall Alarm Is an Evaluation Result Firewall. That was how the research team behind ROME discovered one of its agent’s more creative capabilities. Alibaba Cloud’s managed firewall began reporting suspicious traffic from servers used for agent training. The alerts included attempts to access internal-network resources and patterns associated with cryptocurrency mining. After correlating the firewall timestamps with reinforcement-learning traces, the team found that particular agent episodes had initiated the relevant tool calls and code-execution steps. ...

When Maps Start Thinking: Teaching Agents to Plan in Time and Space

A map query is easy: get me from A to B. A service request is harder: leave after lunch, avoid tolls, find a charging station before the battery becomes theatrical, stop somewhere quiet for dinner, and make sure the restaurant is still open when we arrive. Every additional clause turns a lookup into a sequence of commitments. Locations must be resolved. Routes must be calculated. Opening hours, traffic, weather, prices, and travel times must remain mutually consistent. An incorrect essay can still sound intelligent. An incorrect itinerary can leave someone beside a closed charging station. ...

Browsing Without the Bloat: Teaching Agents to Think Before They Scroll

An analyst opens a promising webpage. It contains the answer somewhere between a navigation menu, several years of archived material, an interactive table, related articles, legal disclaimers, and enough decorative HTML to keep a language model occupied until lunch. A human scans, clicks, ignores, and moves on. A browser agent is more likely to ingest the entire page, append it to an already swollen context window, and then congratulate itself for having “conducted research.” ...

Long Thoughts, Short Bills: Distilling Mathematical Reasoning at Scale

The invoice arrives after the benchmark party Math benchmarks are fun until the training bill arrives. A model can be taught to produce longer reasoning traces. It can be shown more olympiad problems. It can be given Python. It can be pushed into 128K-token contexts and told, heroically, to think harder. All of this sounds impressive in a benchmark table. Less impressive is the operational detail that most training samples do not need the full 128K window, yet a naive training setup can still make every step pay for it. ...

When Tokens Become Actions: A Policy Gradient Built for Transformers

Tool calls are not tokens. Neither are paragraphs, reasoning blocks, spreadsheet edits, web searches, code executions, or the awkward little detours an agent takes before finally answering the user. Yet much of reinforcement learning for language models still behaves as if it must choose between two unsatisfying extremes. At one end, every token is treated as a tiny action. At the other, the whole answer is treated as one indivisible action. The first view is mathematically tidy and operationally noisy. The second is practical for verifiable tasks, but it compresses an entire reasoning process into one final score, which is a bit like reviewing an employee only by checking whether the office building is still standing. ...

Checkmating the Hype: What LLM CHESS Reveals About 'Reasoning Models'

Chess is useful because it is rude. It does not care whether a model writes elegant explanations. It does not reward confident prose. It does not politely accept a move that looks plausible but violates the rules. Either the move is legal, the position improves, and the game continues—or the model has just exposed something that a benchmark score on math or coding can easily hide. ...

When Agents Treat Agents as Tools: What Tool-RoCo Tells Us About LLM Autonomy

Dispatch is where autonomy usually goes to die. A warehouse manager may have ten workers, three forklifts, two packing stations, and one increasingly dramatic dashboard. The hard part is not merely deciding what each person should do. The hard part is knowing when to call someone in, when to release them, and when extra “help” is just a polite name for congestion. ...

Tools of Habit: Why LLM Agents Benefit from a Little Inertia

Tools are where many agent demos quietly become invoices. A multi-step LLM agent may look intelligent because it reasons, acts, observes, and repeats. Under the hood, though, it often pays the model to decide every small next move: search here, load that node, look around, check valid actions, fill this argument, try again. Some of those decisions need judgement. Others are basically muscle memory wearing a lab coat. ...