Reinforcement Learning

Forget Me Not: How IterResearch Rebuilt Long-Horizon Thinking for AI Agents

A research workflow usually starts clean. The first search is sensible. The first source is relevant. The first reasoning step looks promising. Then the agent opens five webpages, follows a few tangents, remembers an early mistake too faithfully, and keeps dragging the whole mess forward like a consultant who refuses to delete old slides. By the time the problem actually becomes difficult, the model is no longer short of information. It is drowning in it. ...

When Agents Think in Waves: Diffusion Models for Ad Hoc Teamwork

A warehouse robot does not fail only when it drops the box. Sometimes it fails earlier, in the quieter moment when another robot takes an unexpected route and the first robot keeps behaving as though the original choreography still exists. Nobody crashes. Nothing explodes. The system merely becomes stupid in a very expensive way. ...

Agents on the Clock: How TPS-Bench Exposes the Time Management Problem in AI

A competent assistant can make a list. A useful assistant knows what must happen first. That distinction sounds small until an AI agent is asked to do something ordinary and annoyingly realistic: check a calendar, search the web, compare options, use a map, assemble a recommendation, and perhaps create a document at the end. None of those steps is exotic. The difficulty is that some of them can run in parallel, some must wait for earlier results, and some become nonsense if executed too early. This is less “genius at work” than “junior operations manager with access to too many browser tabs.” Naturally, it is where things get interesting. ...

When the Sandbox Thinks Back: Training AI Agents in Simulated Realities

Workflow software has a deeply unglamorous problem: reality keeps changing. A customer support agent may know the refund policy, but then the customer changes their address, the order record has a missing field, the tool returns a cryptic error, and the next API call requires a schema nobody mentioned in the demo. A spreadsheet agent may know how to summarise a table, but the file path is wrong, the calendar has a conflicting event, and the “obvious” action fails because the world, in its charmingly vindictive way, is not a benchmark prompt. ...

When Markets Dream: The Rise of Agentic AI Traders

Liquidity is boring until it vanishes. Most investors notice market makers only when the screen suddenly looks thin: fewer bids, wider spreads, worse execution, and the faint smell of panic priced into every click. A market maker’s job is not glamorous. It quotes buy and sell prices, earns the spread, manages inventory, and tries not to become the proud owner of too much of the wrong asset at the wrong moment. Finance, as usual, rewards the person who stands calmly in the middle of everyone else’s urgency. ...

Evolving Minds: How LLMs Teach Themselves Through Adversarial Cooperation

Training data is the quiet tax on modern AI. Someone has to write the examples, verify the answers, clean the failures, and pretend the spreadsheet is a strategy. Reinforcement learning makes that tax even more visible: if a model is supposed to improve through feedback, then the organisation must either provide ground-truth answers, hire evaluators, or build verifiers that can tell success from nonsense. ...

Deep Thinking, Dynamic Acting: How DeepAgent Redefines General Reasoning

Tools are where agent demos go to die. The pitch is usually elegant. Give the model a goal, attach a few APIs, let it reason, and watch the automation glide across systems like a tiny consultant with no calendar conflicts. Then the real world appears: too many tools, unclear documentation, stale context, partial failures, long interaction histories, and the occasional API response that seems to have been designed by someone settling a personal score. ...

Blueprints of Agency: Compositional Machines and the New Architecture of Intelligence

A prototype begins innocently enough: a product team wants a small machine, a vehicle, a tool, a fixture, perhaps a mechanism that throws something across a room because medieval engineering apparently never left the group chat. The modern AI pitch says the agent can design it. Give it parts, constraints, and a goal; let it reason; let it test; let it improve. ...

Plan>Then>Profit: Reinforcement Learning That Teaches LLMs to Outline Before They Think

Planning is usually the part of work everybody claims to value and nobody wants to inspect. The deck has a roadmap. The project has a strategy. The model has a chain of thought. Splendid. Now, does the plan actually make the execution better, or is it just theatre with bullet points? That is the useful question behind Plan Then Action: High-Level Planning Guidance Reinforcement Learning for LLM Reasoning, which introduces PTA-GRPO, a reinforcement-learning method that trains language models to generate an explicit analytic plan before detailed reasoning and then rewards the quality of that plan, not merely the final answer.1 ...

Paths, Not Parrots: When RL Makes LLMs Plan—and When It Doesn’t

A workflow agent usually looks clever right up to the moment one service is down, one permission changes, or one customer case arrives with the wrong sort of mess attached. Then the question becomes painfully simple: did the model learn a plan, or did it learn the usual route? That distinction is the centre of Benefits and Pitfalls of Reinforcement Learning for Language Model Planning: A Theoretical Perspective, an ICLR 2026 paper by Siwei Wang, Yifei Shen, Haoran Sun, Shi Feng, Shang-Hua Teng, Li Dong, Yaru Hao, and Wei Chen.1 The paper is not another victory lap for reinforcement learning. It is more useful than that. It asks what, mechanically, changes when a language model is trained for planning with reinforcement learning rather than supervised fine-tuning. ...