Tool-Use

From Seeing to Doing: Why Agentic AI Still Trips Over Reality

Tools do not make an agent; they make the failure more interesting Camera. Browser. Crop tool. Search engine. Python sandbox. That sounds like the beginning of an intelligent workflow. Give a multimodal model these tools, and it should move from merely seeing the world to actually doing something with it: zoom into the blurry sign, search the extracted clue, cross-check the result, and produce the answer. ...

Pre-Decision Intelligence: When AI Decides Before It Thinks

Audit logs are comforting things. They tell managers that a system took an action, they tell engineers which step fired, and they tell compliance teams that someone, somewhere, has a line of text to point at when the incident review begins. Now imagine an AI agent inside a business workflow. It has a customer request, a list of available tools, and a visible reasoning trace. The trace says it carefully considered whether to call an API, ask for missing information, or answer directly. It sounds deliberate. It sounds inspectable. It sounds like governance. ...

Friction Over Fiction: Why AI Agents Need to Feel Resistance

Tools are not free. That sentence sounds too obvious to deserve an article, which is usually a warning that the industry has built several architectures pretending it is false. A tool-using AI agent can call a search API, query a database, inspect a document, ask another model, trigger a diagnostic pipeline, or run a workflow step. In a clean demo, each call feels like another harmless unit of intelligence. The agent thinks, acts, observes, thinks again, and the audience applauds because the trace looks busy. Busy is often mistaken for capable. Enterprise software has enjoyed this little confusion for decades. ...

The Cost of Thinking Twice: Why Agentic AI Needs a CFO

Budget. That is the word agentic AI usually discovers after the demo is over. During the demo, the agent searches again. It verifies again. It calls another tool, adds another reasoning step, and produces an answer that feels satisfyingly deliberate. In production, the same behavior becomes less charming. Tokens accumulate, latency stretches, logs become harder to inspect, and nobody is entirely sure whether the last two tool calls were useful or just the machine equivalent of pacing around the room with a clipboard. ...

Act While Thinking: When AI Agents Learn to Multitask (Finally)

Waiting is the least glamorous part of an AI agent. A user asks for a report, a code fix, a dataset analysis, or a literature scan. The agent thinks, calls a tool, waits, reads the result, thinks again, calls another tool, waits again, and repeats this little ritual until the final answer appears. From the outside, this looks like “reasoning.” From the system side, much of it is simply queueing around tools. ...

Topology Trouble: Why Even Frontier LLMs Still Get Lost in a Grid

Grid. It looks like the friendliest possible structure. Rows, columns, symbols, rules. No blurry photos, no social nuance, no awkward customer email written at 1:13 a.m. Just a small board and a set of constraints. Naturally, this is where modern reasoning models still manage to embarrass themselves. The paper introducing TopoBench studies a deceptively simple question: can frontier large language models solve topology-heavy grid puzzles where the answer depends on connectivity, loop closure, symmetry, visibility, and state consistency?1 The answer is not “never.” That would be too easy. The answer is more annoying: models often understand enough to start correctly, reason long enough to sound competent, and then lose the structure that makes the solution valid. ...

When AI Agents Read the Manual: Why τ-Knowledge Exposes the Limits of LLM Reasoning

A customer asks a banking agent to handle a routine request. Freeze a card. Replace a lost wallet. Open a better savings account. Close an old credit card. Apply a referral bonus. Nothing here sounds like artificial general intelligence. It sounds like Tuesday morning in a customer support queue. Then the agent has to read the internal policy, discover which tool exists, verify the customer’s account state, notice that one action blocks another, decide whether the user’s claim needs verification, and make the right database update. ...

Think, Then Do: Why ReAct Turned LLMs into Real Agents

A chatbot answers. An agent checks. That distinction sounds small until a workflow fails at 2:17 p.m. because the model confidently invented a policy clause, skipped the database lookup, and then explained itself with the serene authority of a consultant who has already left the building. The 2022 paper ReAct: Synergizing Reasoning and Acting in Language Models matters because it made that failure mode harder to ignore.1 It did not simply ask language models to “think step by step.” Chain-of-thought prompting already did that. It did not simply attach a search box to a model. Retrieval-augmented systems were already moving in that direction. The paper’s real contribution was more architectural: it showed that a language model could alternate between reasoning, acting, observing, and revising its next move. ...

Trust Issues? Fixing Test-Time RL with Verified Votes

A model can be wrong in a very human way: not by hesitating, but by becoming popular with itself. That is the uncomfortable premise behind Tool Verification for Test-Time Reinforcement Learning, a new paper proposing T3RL, or Tool-Verification for Test-Time Reinforcement Learning.1 The paper studies a specific weakness in label-free test-time reinforcement learning: when a reasoning model generates many candidate solutions, uses majority voting as a pseudo-label, and then trains itself toward that answer, the “most common” answer may simply be the most common mistake. ...

Gamma Rays and Toolboxes: Why Superintelligence May Be a Systems Engineering Problem

Toolboxes are not glamorous. Nobody gives a keynote about the screwdriver. Nobody writes breathless think-pieces about the socket wrench. But when a complicated system fails, the difference between “genius” and “expensive confusion” is often whether the operator had the right tool, used it at the right moment, and trusted it to do the part humans should not pretend to do mentally. ...