Cover image

Strategy as a Service: When AI Learns How to Think

Every enterprise AI team eventually meets the same annoying bill: the agent that thinks too much. It calls tools when a direct answer would do. It loops through evaluator prompts for tasks that need one clean instruction. It drags a code interpreter into a problem that is mostly reading comprehension. Then, after all that expensive theatre, it may still be wrong. Very impressive. Very modern. Very invoicable. ...

November 17, 2025 · 14 min · Zelina
Cover image

Plans, Tokens, and Turing Dreams: Why LLMs Still Can’t Out-Plan a 15-Year-Old Classical Planner

TL;DR for operators A new benchmark does not say that LLMs are hopeless at planning. That would be too easy, and also false. It says something more useful: frontier models are now strong enough to solve many formal planning tasks, but their competence still weakens when the task stops giving them semantically meaningful labels.1 ...

November 13, 2025 · 14 min · Zelina
Cover image

The Agent Olympics: How Toolathlon Tests the Limits of AI Workflows

Office work is not one task. It is a chain of small obligations pretending to be one task. “Check the homework submissions, download the attached Python files, run them, grade the students in Canvas, and use the latest submission if someone sent more than one.” That sounds like a normal administrative request. It is also a compact torture device for an AI agent. The agent must read email, handle attachments, inspect local files, run code, interpret results, map students to course records, update Canvas, and not confidently grade the wrong person. Easy, apparently, as long as nothing has to actually work. ...

November 4, 2025 · 17 min · Zelina
Cover image

Deep Thinking, Dynamic Acting: How DeepAgent Redefines General Reasoning

Tools are where agent demos go to die. The pitch is usually elegant. Give the model a goal, attach a few APIs, let it reason, and watch the automation glide across systems like a tiny consultant with no calendar conflicts. Then the real world appears: too many tools, unclear documentation, stale context, partial failures, long interaction histories, and the occasional API response that seems to have been designed by someone settling a personal score. ...

October 31, 2025 · 15 min · Zelina
Cover image

Recon, Then Wreck the Roadblocks: How Recon‑Act Turns Web Stumbles into Tools

A browser agent does not usually fail like a heroic machine confronting the limits of intelligence. It fails like an intern on a badly designed website. It opens the wrong listing. It misses the tiny sort option. It clicks around because the page has too much visual noise and not enough obvious structure. It sees the button but not the pattern. Then, because the agent has no lasting operational memory of the stumble, the next task sends it back into the same swamp with a fresh pair of shoes. ...

October 2, 2025 · 16 min · Zelina
Cover image

Org Charts for Robots: What AgentArch Really Tells Us About Enterprise AI

Enterprise AI teams love an architecture diagram. Boxes, arrows, specialist agents, memory stores, tool registries, a tasteful orchestrator sitting at the top like a middle manager with JSON access. It looks reassuring. It looks intentional. It also looks suspiciously like the kind of thing that can fail in six different places while still producing a beautifully formatted answer. ...

September 20, 2025 · 16 min · Zelina
Cover image

Click Less, Do More: Why API-GUI + RL Could Finally Make Desktop Agents Useful

TL;DR for operators ComputerRL is not interesting because a 9B model learned to click slightly better. That would be charming, in the way a robot vacuum wedged under a sofa is charming. The paper matters because it attacks the three actual bottlenecks in desktop automation: the wrong interface, the wrong training scale, and the wrong assumption that long RL runs keep exploring by magic.1 ...

August 20, 2025 · 16 min · Zelina
Cover image

Mind the Gap: How Tool Graph Retriever Fixes LLMs’ Missing Links

TL;DR for operators A user asks an AI agent to delete an account. The obvious tool is DeleteAccount. A normal semantic retriever will probably find it. Splendid. The agent still fails if it misses GetUserToken, because the deletion tool needs a token first. This is the failure mode Tool Graph Retriever, or TGR, is built to address.1 ...

August 8, 2025 · 18 min · Zelina
Cover image

From GUI Novice to Digital Native: How SEAgent Teaches Itself Software Autonomously

TL;DR for operators Software automation usually breaks at the interface between “the process is known” and “the application has changed again.” A button moves. A settings panel is renamed. A vendor ships a redesign with the emotional restraint of a toddler near glitter. The usual answer is more labelled demonstrations, more brittle scripts, or more human babysitting. ...

August 7, 2025 · 16 min · Zelina
Cover image

Mirage Agents: When LLMs Act on Illusions

TL;DR for operators LLM agents do not merely hallucinate by saying false things. They hallucinate when they act on a version of the world that does not match the task, the history, or the screen in front of them. That is the useful idea in MIRAGE-Bench: it treats agent hallucination as context-unfaithful action. The agent may click a button that is not there, assume a page transition succeeded when it did not, answer a colleague’s question with invented information, submit code despite failed tests, or report success when the environment says otherwise. Very industrious. Very confident. Very much not what you want near production systems. ...

July 29, 2025 · 19 min · Zelina