Reinforcement Learning

When Rewards Learn Back: Evolution, but With Gradients

Rewards are where many agent projects go to become expensive folklore. A team wants an AI agent to complete long workflows: search, reason, call tools, check constraints, recover from mistakes, and produce a useful answer. The model can talk. The tools work. The benchmark demo is acceptable. Then reinforcement learning enters the room, and someone has to decide what “good” means at every step. ...

When Tokens Become Actions: A Policy Gradient Built for Transformers

Tool calls are not tokens. Neither are paragraphs, reasoning blocks, spreadsheet edits, web searches, code executions, or the awkward little detours an agent takes before finally answering the user. Yet much of reinforcement learning for language models still behaves as if it must choose between two unsatisfying extremes. At one end, every token is treated as a tiny action. At the other, the whole answer is treated as one indivisible action. The first view is mathematically tidy and operationally noisy. The second is practical for verifiable tasks, but it compresses an entire reasoning process into one final score, which is a bit like reviewing an employee only by checking whether the office building is still standing. ...

RL Grows a Third Dimension: Why Text-to-3D Finally Needs Reasoning

A chair is not a picture of a chair. That sounds obvious until a text-to-3D system forgets the backrest from one angle, gives the chair three legs from another, paints the seat correctly, and somehow convinces a weak evaluator that the job is mostly done. In 2D generation, a model can often survive by producing a plausible view. In 3D generation, every view is a witness. Geometry, texture, object parts, and spatial relationships all have to agree. Annoying, yes. Also the entire point. ...

Agents Without Time: When Reinforcement Learning Meets Higher-Order Causality

Handoffs Are Where Fixed Time Sneaks Into Agent Design Handoffs look harmless. One agent collects evidence, another checks it, a third decides, and a fourth sends the answer to a customer, robot, trader, or dashboard. The workflow diagram has arrows. The arrows have a direction. Someone decided which component acts first. Usually that decision is treated as engineering housekeeping. In Matt Wilson’s paper, it becomes the point of the story.1 ...

Fault, Interrupted: How RIFT Reinvents Reliability for the LLM Hardware Era

A chip does not need to fail everywhere to fail badly A modern AI accelerator is not fragile in the poetic sense. It is not a porcelain teacup trembling on the edge of a desk. It is much more annoying than that. It can run billions of parameters at high throughput, survive ordinary engineering noise, and still contain a few small fault locations where one carefully placed disturbance can turn a capable model into expensive decorative silicon. The problem is not that every bit matters equally. The problem is that a few bits may matter absurdly more than the rest. ...

Teach Me Once: How One‑Shot LLM Guidance Reshapes Hierarchical Planning

Teach Me Once, Then Please Stop Calling the API A familiar enterprise automation story starts with a competent but expensive expert in the loop. At first, the expert is useful. They interpret messy instructions, break tasks into sensible stages, and recover when something goes wrong. Then the workflow scales. Suddenly the expert is being called for every transaction, every exception, every tiny decision that could probably have been handled by a trained local process. What began as intelligence becomes latency, cost, and operational dependency. Very elegant. Very billable. Not always very deployable. ...

Clipped, Grouped, and Decoupled: Why RL Fine-Tuning Still Behaves Like a Negotiation With Chaos

Training a reasoning model sounds wonderfully modern until the model discovers that “being correct” and “looking correct enough to satisfy the reward” are not the same career path. That is the quiet problem behind reinforcement learning fine-tuning for large language models. The research conversation often treats methods like PPO, GRPO, and DAPO as a sequence of upgrades: first the classic algorithm, then the critic-free group method, then the decoupled-and-dynamically-sampled variant with a nicer acronym. Very tidy. Unfortunately, models do not read product positioning decks. ...

No Prompt Left Behind: How Shopee’s CompassMax Reinvents RL for Giant MoE Models

Rollouts are expensive little creatures. They consume GPU time, produce long reasoning traces, wait for reward computation, and then—if the reward signal is flat—contribute exactly nothing to learning. The GPU was busy. The training dashboard looked serious. The model learned no usable distinction. Very productive, in the same way a meeting with twelve people and no decision is productive. ...

Prompt, Probe, Persist: How Multi‑Turn RL Is Rewriting the Jailbreak Playbook

A chatbot rarely fails all at once. In production, failure is usually more boring than cinema. A user asks something borderline. The model refuses. The user rephrases. The model gives a harmless explanation. The user narrows the topic. The model follows the conversation. Then, several turns later, the assistant provides content it should not have provided. No thunder. No villain monologue. Just an interaction history doing what interaction histories do: accumulating context. ...

Worlds Within Reach: How SIMA 2 Turns Virtual Environments into Training Grounds for Generalist Agents

Games are not toys to an AI lab. They are controlled worlds with messy consequences. A game gives an agent what enterprise software and robotics both struggle to provide at scale: visual ambiguity, delayed goals, menus, navigation, tool use, failure states, and a reset button that does not involve a broken warehouse robot or a furious operations manager. That is why Google DeepMind’s SIMA 2 paper is more interesting than “AI can play games again.” We have had that headline several times. It is getting a little tired, and it should probably hydrate. ...