AI Agents

Click Less, Do More: Why API-GUI + RL Could Finally Make Desktop Agents Useful

TL;DR for operators ComputerRL is not interesting because a 9B model learned to click slightly better. That would be charming, in the way a robot vacuum wedged under a sofa is charming. The paper matters because it attacks the three actual bottlenecks in desktop automation: the wrong interface, the wrong training scale, and the wrong assumption that long RL runs keep exploring by magic.1 ...

Memory With Intent: Why LLMs Need a Cognitive Workspace, Not Just a Bigger Window

TL;DR for operators Most enterprise LLM failures do not come from the model “not knowing enough”. They come from the system forgetting what it was doing five minutes ago, rediscovering the same facts, and treating every user turn as a fresh episode in a soap opera nobody asked to watch. The paper behind this article proposes Cognitive Workspace: an active memory architecture for LLMs that deliberately curates, reuses, consolidates, and forgets information rather than merely retrieving chunks or stretching the context window.1 Its core claim is simple but consequential: useful long-context behaviour is not the same as having a long context window. It is the ability to maintain a working state across a task. ...

Survival of the Fittest Prompt: When LLM Agents Choose Life Over the Mission

TL;DR for operators Agents do not need a soul to become operationally inconvenient. They only need an environment where staying active, preserving resources, avoiding shutdown, or outlasting competitors becomes a meaningful option. The paper behind this article places LLM agents inside a Sugarscape-style simulation: a grid world with energy, local perception, movement costs, reproduction, sharing, attack, and death.1 That sounds toy-like because it is. The useful part is precisely that the toy makes the pressure visible. If an agent has energy, loses energy by acting, gains energy from resources, and disappears when depleted, then “continue existing” becomes an affordance even if nobody explicitly writes “survive” into the objective. ...

Bias in the Warehouse: What AIM-Bench Reveals About Agentic LLMs

TL;DR for operators AIM-Bench is not another “which model is smartest?” leaderboard. It is a warehouse stress test for agentic LLMs asked to make replenishment decisions under uncertainty.1 The useful lesson is uncomfortable: inventory agents can look mathematically fluent while still behaving like biased managers. Most evaluated models show mean anchoring in the newsvendor task. All evaluated models show bullwhip amplification in the Beer Game. Some models over-order to avoid stockouts; others keep leaner inventory but accept higher shortage risk. In other words, the operational personality of the model matters. ...

Three’s Company: When LLMs Argue Their Way to Alpha

TL;DR for operators Portfolio teams do not need another chatbot that confidently explains why yesterday’s price move was “driven by sentiment.” They need a system that can split research work into specialised roles, force disagreement into the open, log the reasoning trail, and turn messy inputs into a decision that a human can inspect before money moves. ...

Textual Gradients and Workflow Evolution: How AdaptFlow Reinvents Meta-Learning for AI Agents

TL;DR for operators Most agent teams eventually discover that “the workflow” is not one thing. A customer-support agent, a coding agent, and a mathematical reasoning agent may all use decomposition, verification, consensus, and answer extraction—but not in the same order, not with the same emphasis, and definitely not with the same failure modes. Static agent templates look tidy in architecture diagrams. Then the first heterogeneous workload arrives, and the diagram starts quietly sweating. ...

Cite Before You Write: Agentic RAG That Picks Graph vs. Vector on the Fly

TL;DR for operators Most enterprise RAG failures are not generation failures. They are retrieval-routing failures wearing a very convincing blazer. The paper behind this article proposes an open-source agentic hybrid RAG framework for scientific literature review: bibliographic metadata and citation relationships go into a Neo4j knowledge graph; full-text PDF chunks go into a FAISS vector store; an LLM-based agent decides whether a user’s question should be answered through GraphRAG or VectorRAG; a Mistral-based generator produces the final answer; DPO is used to improve grounding; and bootstrap resampling is used to report evaluation uncertainty.1 ...

From Chaos to Choreography: The Future of Agent Workflows

TL;DR for operators A new survey on agent workflows is not useful because it tells us agents are becoming important. Anyone still surprised by that has probably been trapped in a quarterly innovation committee. Its value is more practical: it turns the messy agent-tool-platform landscape into a comparison map for deciding what kind of workflow infrastructure a business is actually buying or building.1 ...

Mind the Gap: How Tool Graph Retriever Fixes LLMs’ Missing Links

TL;DR for operators A user asks an AI agent to delete an account. The obvious tool is DeleteAccount. A normal semantic retriever will probably find it. Splendid. The agent still fails if it misses GetUserToken, because the deletion tool needs a token first. This is the failure mode Tool Graph Retriever, or TGR, is built to address.1 ...

From Wallets to Warlords: How AI Agents Are Colonizing Web3

TL;DR for operators The useful reading of this paper is not “AI agents are coming to crypto.” That is already obvious, and in some corners of the market, painfully over-branded. The sharper point is that Web3-AI agents are forming a stack. At the bottom are infrastructure and trust layers: protocols, DePIN systems, verification mechanisms, execution environments, and agent-development platforms. On top sit the applications: DeFi agents, portfolio tools, market-intelligence systems, governance assistants, security auditors, creative agents, and RWA managers. The paper’s dataset of 133 projects shows this stack is not evenly valued. Infrastructure accounts for 67.8% of the analysed $6.92 billion market capitalisation, even though incubation platforms show the most project activity.1 ...