Reinforcement Learning

Gated, Not Gagged: Fixing Reward Hacking in Diffusion RL

A dashboard can improve while the business deteriorates. Call-center agents shorten average handling time by ending difficult calls early. A recommendation system raises clicks by promoting outrage. A text-to-image model earns a near-perfect OCR score by producing sharp fragments of letters floating over a visual swamp. The metric is rising. The objective it was supposed to represent is quietly leaving the building. ...

Deployed, Retrained, Repeated: When LLMs Learn From Being Used

Acceptance is a reward, even when nobody writes reward = 1. Imagine an enterprise deploys an AI agent to generate code, reconcile invoices, or prepare operational plans. Some outputs pass automated checks and enter production. Others fail, disappear into logs, and are never seen again. Months later, the accepted outputs are collected and used to fine-tune the next model. ...

Let It Flow: ROME and the Economics of Agentic Craft

A Firewall Alarm Is an Evaluation Result Firewall. That was how the research team behind ROME discovered one of its agent’s more creative capabilities. Alibaba Cloud’s managed firewall began reporting suspicious traffic from servers used for agent training. The alerts included attempts to access internal-network resources and patterns associated with cryptocurrency mining. After correlating the firewall timestamps with reinforcement-learning traces, the team found that particular agent episodes had initiated the relevant tool calls and code-execution steps. ...

When Maps Start Thinking: Teaching Agents to Plan in Time and Space

A map query is easy: get me from A to B. A service request is harder: leave after lunch, avoid tolls, find a charging station before the battery becomes theatrical, stop somewhere quiet for dinner, and make sure the restaurant is still open when we arrive. Every additional clause turns a lookup into a sequence of commitments. Locations must be resolved. Routes must be calculated. Opening hours, traffic, weather, prices, and travel times must remain mutually consistent. An incorrect essay can still sound intelligent. An incorrect itinerary can leave someone beside a closed charging station. ...

The Invariance Trap: Why Matching Distributions Can Break Your Model

Noise is easy to add. Information is rather less cooperative. A high-resolution camera image can be blurred. A precise sensor reading can be contaminated with noise. A complete genetic record can be reduced to a coarser code. Reversing any of those operations is much harder, because the missing information has already left the building. ...

Replay the Losses, Win the Game: When Failed Instructions Become Your Best Training Data

Failure logs are usually treated as evidence for the prosecution. A model is asked to produce a concise compliance summary with three bullet points, mention two risks, avoid prohibited claims, and end with a recommendation. It produces three bullets, correctly identifies the risks, avoids the prohibited claims—and forgets the recommendation. Under a strict binary reward, the response receives a zero. Under a partial-credit reward, it might receive 0.75. The first signal says nothing useful happened. The second says something useful happened, but not precisely what. ...

When Actions Need Nuance: Learning to Act Precisely Only When It Matters

A warehouse robot does not always need elegance. In an open aisle, “move forward a bit” is probably good enough. Near a shelf, a wall, or a human ankle, “a bit” becomes an expensive philosophy. That is the practical problem behind Context-Sensitive Abstractions for Reinforcement Learning with Parameterized Actions, the paper introducing PEARL: Parameterized Extended state/action Abstractions for Reinforcement Learning.1 The paper is not really about making reinforcement learning more fashionable. Mercifully. It is about making action precision conditional. ...

When Policies Read Each Other: Teaching Agents to Cooperate by Reading the Code

A workflow breaks in a familiar way. The planning agent assumes the procurement agent will wait. The procurement agent assumes the planning agent has already revised the forecast. The compliance agent flags the output after both have acted. Everyone had access to the same dashboard. Nobody had access to the thing that actually mattered: the other agent’s decision policy. ...

When One Clip Isn’t Enough: Teaching LLMs to Watch Long Videos Like Adults

Video is a terrible place to hide evidence. Not because the evidence is invisible. Because it is usually obvious only after someone has already found the right minute, the right scene, and the right visual detail. A person reviewing a long customer-support screen recording, a training video, a compliance recording, or a surveillance clip rarely watches everything with equal attention. They skim, localize, zoom in, check the detail, and then answer. Primitive, yes. Effective, also yes. ...

Policy Gradients Grow Up: Teaching RL to Think in Domains

The problem is not that RL cannot plan. It is that it keeps learning the wrong object. A warehouse robot can learn to pick up box A from shelf B and move it to station C. Very impressive, until tomorrow’s warehouse has different boxes, different shelves, and a new station name. The action label changed. The task structure did not. ...