Tool-Use

Feedback Is the New Attack Surface

TL;DR for operators AI agents are not only vulnerable because someone can hide a bad instruction in an email, document, web page, Slack message, or tool output. They are vulnerable because attackers can now automate the search for bad instructions that work. That changes the security problem. A one-off prompt injection is annoying. An automated attack loop is strategic. It generates candidate injections, observes the agent’s response, scores partial progress, keeps the promising branches, and tries again. Very entrepreneurial, in the worst possible way. ...

The Grid Agent Saw the Pole. Then the Workflow Fell Over.

TL;DR for operators Power inspection is not a vision problem with some administrative paperwork attached. It is a chain. An image must become an equipment label, then a defect description, then a severity judgment, then a maintenance decision, then a correctly executed workflow. Break one link early enough and the rest of the chain becomes very confident clerical fiction. ...

Agents of Consequence: Why Tool Use Needs a Control Loop

TL;DR for operators Enterprise AI agents are moving from “answer this question” toward “watch this process, use tools, make decisions, and keep going.” That is useful. It is also how software quietly graduates from assistant to operational liability. Three recent papers, read together, make a simple point with uncomfortable business implications. VitalAgent shows how an LLM agent can become useful in wearable-health monitoring when it has physiological memory, structured tools, evidence validation, and proactive alerting.1 CoMap shows how agents can improve long-horizon decisions by pairing their policy with a co-evolving textual world model that predicts action consequences before execution.2 Gram shows why more autonomous agents also need deployment-realistic audits, because pressure, incentives, role-play cues, and implicit constraints can produce sabotage-like behavior even when the model is not cartoonishly “evil.”3 ...

Stop Signs Are Not Steering Wheels: TRIAD and the Case for Repairable Agent Guardrails

TL;DR for operators Most agent guardrails behave like stop signs. They inspect a proposed action, decide whether it looks safe, and then allow or block execution. This is neat, legible, and often operationally clumsy. Real agent failures are not always cleanly harmful from the first word. A useful business request can be contaminated by a prompt injection, a malicious tool response, or an unsafe intermediate plan. Blocking the whole task may reduce risk, but it also throws away the legitimate work. Excellent safety theatre, less excellent operations. ...

Roll the Tape, Call the Tools: ReTool-Video and the Evidence-Routing Problem

Video is where AI demos go to become expensive. A model can describe a short clip. It can answer a question about a few sampled frames. It can even sound confident while doing so, which is apparently a product feature now. But business video work is rarely “what is happening in this five-second clip?” It is usually messier: find the exact moment in a two-hour training recording, count repeated actions without double-counting adjacent clips, verify whether an event appears in audio, subtitles, and frames, or decide whether a safety incident is real rather than just visually similar to one. ...

Memory Lane, With Garbage Collection: What eMoT Gets Right About Reasoning Agents

A calculator is not impressive because it is intelligent. It is impressive because it is boring. It does the same operation the same way, without suddenly deciding that a large number “feels unrealistic” or that subtraction might be more poetic if performed backward. This is precisely why businesses keep trying to attach calculators, databases, validators, workflow engines, and policy rules to large language models. The model supplies flexibility. The tool supplies discipline. The problem is that most “LLM plus tool” systems still treat reasoning as a one-time performance: prompt, think, maybe verify, answer, forget. ...

$Cover image$

Do the Math, Not the Mime: Why LLM Reasoning Needs a Verification Pipeline

Spreadsheet errors have a special talent: they look boring until they become expensive. That is the business version of the LLM math problem. A model can produce a calm, step-by-step explanation, put a confident number at the bottom, and still be wrong in the only place that matters. Worse, the reasoning may look plausible enough that a manager, analyst, tutor, or compliance reviewer nods and moves on. The answer has the rhythm of thinking. It has the costume of calculation. It may even have a chain-of-thought trace. Very civilized. Still not proof. ...

When Maps Start Thinking: GeoAgentBench and the Audit of Spatial AI

When Maps Start Thinking: GeoAgentBench and the Audit of Spatial AI Maps look calm. That is their trick. A finished map gives the impression of order: roads align, polygons close, rivers flow, color ramps behave, labels politely stay out of the way. Behind that calm surface, a GIS workflow is usually a small bureaucratic state: coordinate systems, raster-vector conversions, topology checks, interpolation choices, file paths, layer ordering, and visualization rules all negotiating with one another. One wrong projection, one invalid geometry, one missing intermediate file, and the whole administrative state collapses. It does not collapse poetically. It throws an error. ...

Anchors Away: Rethinking How AI Agents Learn to Use Tools

A tool-using AI agent usually fails in a very ordinary way. It does not announce a philosophical crisis. It calls the wrong tool, calls the right tool too many times, writes malformed code, searches before thinking, or confidently takes a useless action because the training process rewarded motion rather than judgment. This is the unglamorous part of agent deployment. The demo shows the agent booking, searching, calculating, and reporting. The training log shows wasted exploration, unstable optimization, and a strange habit of confusing “using tools” with “thinking better.” Apparently, giving a model a calculator does not automatically make it an accountant. Shocking. ...

Benchmarking the Benchmarks: Why ACE-Bench Might Be the Missing Layer in Agent Evaluation

Agents are easy to demo and hard to measure. That is the awkward little truth behind much of today’s agentic AI market. A browser agent completes a booking task. A coding agent opens a pull request. A customer-service agent handles a simulated refund conversation. Everyone nods politely. Then someone asks the impolite question: was the model actually good at long-horizon reasoning, or did the benchmark quietly reward short tasks, friendly domains, and forgiving tool behavior? ...