AI Agents

Add to Cart, Add to Power: What Happens When AI Shops for You

TL;DR for operators AI shopping agents do not simply “find the best product.” They convert a messy human browsing process into a model-mediated allocation system. That allocation system has its own priors, positional quirks, trust cues, and semantic blind spots. Lovely. We automated the customer and discovered a new customer. The paper introduces ACES, a controlled sandbox for testing AI shopping behaviour. It pairs a browser-use or API-style buying agent with a programmable mock e-commerce site, then randomises product order, prices, ratings, reviews, badges, and product descriptions to estimate what actually moves an AI agent’s choice.1 ...

Many Minds Make Light Work: Boosting LLM Physics Reasoning via Agentic Verification

TL;DR for operators A familiar enterprise AI failure looks like this: the model gives a confident answer, the formatting is exquisite, the explanation sounds like a gifted teaching assistant, and one equation quietly takes the project into a ditch. Physics is an unusually good place to study that failure because being clear is not enough. The system must interpret the situation, select the right principle, keep the units straight, calculate correctly, and not hallucinate a helpful-but-illegal assumption because the prompt looked lonely. ...

The Roots of Finance: How Reciprocity Explains Credit, Insurance, and Investment

TL;DR for operators Most financial systems are designed as if finance begins with institutions: contracts, lenders, insurers, markets, prices, and enforcement. Paper 2506.00099 asks a cleaner question: what if the core behaviours behind finance emerge before those institutions, from repeated reciprocal interaction?1 The paper’s central move is to treat trade as the simplest case of reciprocity, then derive credit, insurance, token exchange, and investment as structural extensions of the same mechanism. Add delay, and reciprocity starts to look like credit. Add asymmetric risk, and it starts to look like insurance. Add portable mediation, and it starts to look like token exchange. Add expected future reward, and it starts to look like investment. Finance, in this view, is not born fully dressed in a suit carrying a term sheet. It begins as remembered obligation. ...

The User Is Present: Why Smart Agents Still Don't Get You

TL;DR for operators Most agent demos show the easy part: the model calls a tool, gets results, and returns something plausible. The harder part is less cinematic. The user starts with an incomplete request, reveals constraints in fragments, phrases preferences indirectly, changes emphasis mid-conversation, and expects the system to somehow keep up. This is where many supposedly “smart” agents begin to look less like assistants and more like interns with excellent API access. ...

$Cover image$

Tool Up or Tap Out: How Multi-TAG Elevates Math Reasoning with Smarter LLM Workflows

TL;DR for operators Most tool-using LLM workflows still behave like an intern with a favourite spreadsheet: they call one tool, trust the result, and hope the formatting does not catch fire. Multi-TAG proposes a more disciplined pattern. At each reasoning step, the model does not simply choose between chain-of-thought, Python, or WolframAlpha. It asks several tool-backed executors to propose candidate next steps, checks which candidates lead to the same estimated final answer, and then selects the shortest completion among the candidates that agree. That is the useful idea: not “give the model tools,” but “make tools disagree in a controlled way, then use agreement as a verification signal.” ...

Forecasting a Smarter Planet: How EarthLink Reimagines Climate Science with Self-Evolving AI Agents

TL;DR for operators Climate work is not short of data. It is short of usable pathways through data. EarthLink, the system studied in this paper, is best understood as an orchestration layer for climate science: it plans analyses, retrieves relevant data, generates code, runs diagnostics, checks results, produces reports, and stores validated query-code-result patterns for reuse.1 ...

Beyond DNS: Building the Backbone for the Internet of AI Agents

TL;DR for operators If your organisation is building one chatbot, DNS is not your problem. If your organisation expects thousands of autonomous agents to discover one another, verify capabilities, rotate endpoints, respect privacy boundaries, and revoke trust quickly, then DNS starts looking like a filing cabinet in a drone factory. The paper behind NANDA proposes a layered discovery architecture for the “Internet of AI agents”: a lean signed index record called AgentAddr, richer verified metadata called AgentFacts, and optional adaptive resolvers for live endpoint selection.1 The important idea is not that NANDA is “DNS for agents”. That is the tempting headline and, naturally, the least useful one. The paper is really about separating stable identity from dynamic operational metadata and from runtime routing. ...

Adding Up to Nothing: Coarse Reasoning and the Vanishing St. Petersburg Paradox

TL;DR for operators The paper is not a magic trick that turns an infinite expected value into a finite one. The ordinary St. Petersburg expectation still diverges. Anyone claiming otherwise has either missed the point or found a very ambitious way to lose a philosophy seminar. What the paper actually does is more interesting. Takashi Izumo defines a coarse-grained version of arithmetic in which numbers are first mapped into finite “grains,” each grain is represented by a selected internal value, and addition is performed through repeated projection to those representatives.1 Under this operation, an increment can become too small to move the current coarse state. That phenomenon is called absorption. Repeated absorption produces inertness: further additions keep arriving, but the represented total stops changing. ...

Train of Thought: How Long-Haul RL Unlocks LLM Reasoning Diversity

TL;DR for operators NVIDIA’s paper is not saying “train longer and reasoning magically appears.” That would be comforting, simple, and wrong — a classic enterprise AI trifecta. The practical lesson is more surgical: prolonged reinforcement learning can keep improving a small reasoning model, but only when the training loop actively prevents collapse. The model needs verifiable rewards, diverse tasks, enough rollout diversity, careful clipping, a small KL penalty, reward shaping when behaviour goes off the rails, and periodic resets of both the reference policy and optimiser state. In other words, long-horizon RL behaves less like a single training job and more like operating a live system under stress. ...

Truth, Beauty, Justice, and the Data Scientist’s Dilemma

TL;DR for operators The useful question is not whether AI will “replace data scientists”. That framing is wonderfully dramatic and operationally lazy. Timpone and Yang’s paper, AI, Humans, and Data Science: Optimizing Roles Across Workflows and the Workforce, gives a better mechanism: allocate human and AI work by asking what kind of quality each workflow stage needs.1 Early planning needs creative breadth and problem definition. Execution needs accurate, valid, and ethically defensible data and modelling. Activation needs contextual interpretation, stakeholder judgement, and responsible action. ...