Tool-Use

Checklist Capital: Reinforcing Agents Without Verifiable Rewards

Checklist. It is not the most glamorous word in artificial intelligence. It does not sound like a new reasoning architecture, a sovereign model, or a mildly terrifying demo video. It sounds like something an operations manager would use before approving a vendor payment. That is exactly why it matters. Most enterprise agents fail to fit the clean reward structure that reinforcement learning likes. A coding benchmark can verify whether tests pass. A math problem can verify the final answer. A database query can sometimes verify whether a returned value matches the expected record. But business agents live in a less cooperative universe. They ask clarification questions, call internal tools, respect constraints, recover from missing information, and produce replies that are useful without being exactly predictable. ...

World-Building for Agents: When Synthetic Environments Become Real Advantage

A customer-support agent can sound impressive in a demo and still collapse the first time it has to change an address, cancel a duplicate order, rebook a flight, and explain what happened afterward. That collapse usually does not come from weak prose. The model can write the apology beautifully. The problem is that the world behind the apology has state. Orders exist or do not exist. Inventory changes. Refunds create records. A bad tool call can mutate the wrong row. A follow-up answer must reflect what the agent actually did, not what it vaguely intended to do. ...

Agents Need Worlds, Not Prompts: Inside ScaleEnv’s Synthetic Environment Revolution

Workflow automation has a bad habit of looking impressive right up to the moment it touches reality. A demo agent can summarize a refund policy, draft a polite message, and call a refund_order() tool with great confidence. Then the real workflow asks a boring question: does this order exist, is it within the refund window, has it already been refunded, does the customer’s loyalty tier matter, and should the database state change after approval? ...

DRIFT-BENCH: When Agents Stop Asking and Start Breaking

A user says, “Update the record with a sensible value.” That sentence is small. The damage may not be. For a normal chatbot, the worst outcome might be a vague answer wearing a confident expression. Annoying, yes, but usually recoverable. For an agent connected to a database, file system, workflow platform, or API service, the same ambiguity becomes operational. The model may update the wrong row, call the wrong endpoint, overwrite a file, or politely explain its mistake after making it. Charming, in the same way a self-driving forklift is charming. ...

When Empathy Needs a Map: Benchmarking Tool‑Augmented Emotional Support

Empathy is easy to fake for one sentence. A chatbot can say “that sounds exhausting” without knowing anything about you, your situation, your city, your time zone, or whether the advice it is about to give is physically possible. That is the awkward part of emotional support AI: the tone can be soft while the facts are made of air. A very caring assistant can still recommend a midnight walk at 3 p.m., suggest a closed café, or confidently invent local details because it wants to be helpful. The kindness is real enough in style. The grounding is not. ...

CAR-bench: When Agents Don’t Know What They Don’t Know

A car assistant sounds simple until it touches the car. “Turn on the fan.” “Open the sunroof.” “Change my destination to Barcelona.” “Send an email before I arrive.” None of these requests looks philosophically difficult. They are not graduate-level math problems. They do not require poetic reasoning, legal interpretation, or a 128k-token context window stuffed with PDFs. They require the assistant to do something much less glamorous: check the state of the world, follow a few policies, use the right tools, and avoid pretending when something is missing. ...

When Rewards Learn to Think: Teaching Agents How They’re Wrong

An agent fails a task. It searched the web twice, opened the wrong page, trusted a noisy snippet, wrote a plausible final answer, and lost the point. Traditional reinforcement learning sees one thing: wrong. That is brutally clean, and also rather unhelpful. The agent may have performed three useful steps before collapsing at the fourth. Or it may have wandered confidently through nonsense from the beginning. Sparse final-answer rewards flatten these cases into the same training signal. The scoreboard says “0.” Very educational, in the same way a fire alarm teaches architecture. ...

When LLMs Get a Laptop: Why Sandboxes Might Be the Real AGI Benchmark

Laptop. That is the deceptively simple object hiding inside this paper. Not a magic planner. Not a thousand-tool agent marketplace. Not a baroque workflow with seventeen orchestration layers and a dashboard that looks like a cockpit designed by consultants. A laptop. Or, more precisely, a minimal virtual computer: a sandbox with terminal access, file editing, code execution, persistent files, and the ability to install or fetch resources. In Computer Environments Elicit General Agentic Intelligence in LLMs, Cheng et al. ask a question that looks almost too obvious to be interesting until one remembers how much of the AI industry is still trying to squeeze “agency” out of longer prompts.1 ...

MatchTIR: Stop Paying Every Token the Same Salary

Payroll is a useful metaphor for agent training because it makes the absurdity obvious. Imagine a project team where one employee finds the right database, another enters the correct query, a third repeatedly calls the wrong API, and a fourth finally writes the report. If the report is accepted, everyone receives the same bonus. If it fails, everyone receives the same blame. Very democratic. Also very stupid. ...

One Agent Is a Bottleneck: When Genomics QA Finally Went Multi-Agent

One Agent Is a Bottleneck: When Genomics QA Finally Went Multi-Agent Databases are where elegant AI demos go to develop a limp. A model can sound fluent about biology, medicine, finance, or law. Then someone asks a question that requires the latest record from a specialized database, a second lookup from another source, a formatted API call, a large HTML response, and a final answer that does not forget the original question halfway through. Suddenly the “AI assistant” becomes a very expensive intern copying URLs into the wrong field. ...