AI Agents

Paths > Outcomes: Measuring Agent Quality Beyond the Final State

A calendar assistant creates the right meeting. A compliance agent files the right flag. A robotic controller moves the right object. Everyone applauds, because the final state is correct. Then someone checks the logs. The calendar assistant created, deleted, recreated, and re-notified the same meeting. The compliance agent skipped the required policy check and jumped straight to enforcement. The robot got the object into place only after executing a step that would have been unsafe if the power had cut out halfway through. The destination was fine. The route was a mess. In enterprise automation, this is not a philosophical distinction. It is the difference between “the demo worked” and “legal now wants a meeting.” ...

Recon, Then Wreck the Roadblocks: How Recon‑Act Turns Web Stumbles into Tools

A browser agent does not usually fail like a heroic machine confronting the limits of intelligence. It fails like an intern on a badly designed website. It opens the wrong listing. It misses the tiny sort option. It clicks around because the page has too much visual noise and not enough obvious structure. It sees the button but not the pattern. Then, because the agent has no lasting operational memory of the stumble, the next task sends it back into the same swamp with a fresh pair of shoes. ...

When Agents Get Bored: Three Baselines Your Autonomy Stack Already Has

Idle time is not empty time. Anyone who has managed a human team already knows this. Leave a capable person with no clear assignment and they may tidy the backlog, invent a side project, interrogate the process, or spend the afternoon constructing a philosophy of why the calendar is oppressive. Large language model agents, apparently, have their own version of this behaviour. Less caffeine, more JSON, same managerial problem. ...

Keys to the Kingdom… with a Chaperone: How Agentic JWT Grounds AI Agents in Real Intent

Access tokens are convenient little monsters. Hand one to an application and, for a while, the receiving API behaves as if the bearer of that token is a faithful representative of the user. In normal software, that assumption is often good enough. The app has deterministic code. The button does what the button was built to do. The workflow may be dull, but dullness is a security feature. ...

Terms of Engagement: Building Trustworthy AI Agents Before They Build Us

A customer asks your AI assistant to “find me a better phone contract.” The agent browses comparison sites, selects a cheaper plan, authorizes the switch, cancels the old plan, and arranges payment of the cancellation fee from the user’s bank account. Lovely, in the way a self-driving forklift is lovely: impressive until it nudges the wrong shelf. ...

Tool Wars, Protocol Peace: What MCP‑AgentBench Really Measures

A procurement team does not buy an AI agent because it can recite the word “interoperability” with theatrical confidence. It buys the agent because the thing can use tools, collect data, combine results, and stop before it bankrupts the token budget. That is the useful way to read MCP-AgentBench, a new benchmark for evaluating language agents inside the Model Context Protocol ecosystem.1 The paper is not just another leaderboard with a fresh coat of protocol paint. Its more interesting result is harsher: MCP gives agents a common integration layer, but it does not make them competent tool users. Compatibility is plumbing. Competence is orchestration. ...

Agency Check, Please: What a New Benchmark Says About LLMs That Actually Empower Users

A customer asks your AI assistant to choose between two mortgage options. An employee asks whether to quit. A student says, very politely, “Please guide me, but don’t give me the answer.” A lonely user suggests the chatbot feels like a best friend. The easy product answer is: be helpful. The harder answer is: helpful to what? ...

From PDF to PI: Turning Papers into Productive Agents

Every R&D team has a shelf of papers that are theoretically useful and practically booby-trapped. The abstract is promising. The method is relevant. The results look transferable. Then reality arrives wearing a conda error message: the repository has three setup paths, two notebooks, one undocumented dependency, and a tutorial that assumes you already know the answer. The paper has been published. The method has not, in any serious operational sense, been delivered. ...

Graph and Circumstance: Maestro Conducts Reliable AI Agents

A broken AI agent often looks deceptively close to working. It answers most questions. It calls the right tool sometimes. It follows the instruction until the conversation gets long, the retrieval query gets vague, or the arithmetic becomes just difficult enough for the model to start doing spreadsheet theatre. The usual repair is prompt editing. Add a stern sentence. Add a role. Add an example. Add “think step by step,” because apparently the machine needed a motivational poster. ...

Parallel Minds, Shorter Time: ParaThinker’s Native Thought Width

A familiar enterprise AI failure looks less like stupidity and more like stubbornness. Ask a model to solve a hard problem, and it may begin confidently in the wrong direction. Then it keeps going. It adds details. It self-reflects. It spends tokens. It may even apologise to itself internally, which is apparently what we call progress now. But the core path does not change. The model is not merely short on compute. It is trapped inside its own first guess. ...