Reinforcement Learning

Agents That Learn From Their Own Mistakes: The Rise of Retroactive AI

Mistakes are useful only when they are converted into something operational. That is the small, inconvenient detail often missing from agent hype. An LLM agent can fail at a web-shopping task, wander through a simulated room, push the wrong Sokoban box, or uncover the wrong MineSweeper cell. Fine. Failure happens. The useful question is not whether the agent failed. The useful question is whether the system can extract a reusable signal from that failure before the next attempt. ...

Mirror, Mirror on the Agent: Teaching LLMs to Judge Their Own Actions

The agent did exactly what it was taught. That was the problem. A familiar business agent failure does not look dramatic. It looks boring. The agent searches the database, clicks the wrong record, receives an error, retries the same action, receives the same error, retries again, and then politely informs the user that it has encountered “temporary difficulty.” Very professional. Completely useless. ...

The Long Conversation Problem: How MAPO Teaches AI to Care Over Time

Customer support has a familiar failure mode: the first answer sounds polished, the second answer sounds patient, the third answer sounds as if the system has quietly forgotten what problem it is solving. The user is still there. The emotional state has changed. The unresolved issue has shifted. The model, meanwhile, keeps producing individually acceptable replies, like a waiter bringing one beautifully plated dish at a time to the wrong table. ...

Teaching Reinforcement Learning to Think Before It Acts

Agents are easy to impress and hard to trust. Give a reinforcement learning agent a game, a reward signal, and enough time, and it may discover something brilliant. Or it may discover the dumbest possible way to look successful. In Seaquest, that can mean shooting enemies while ignoring oxygen. In Kangaroo, it can mean punching enemies in a corner instead of climbing toward the joey. Technically, points go up. Strategically, the agent has learned the machine-learning equivalent of optimizing a dashboard while the business burns quietly in the background. ...

When the Streets Flood, Let the AI Drive: Reinforcement Learning for Climate‑Resilient Cities

A flooded street is not only a drainage problem. It is a transport problem, a budget problem, an insurance problem, a public-trust problem, and, if the city waits long enough, a very expensive lesson in pretending that yesterday’s weather statistics are still a planning manual. Copenhagen is a useful place to begin because the paper’s case is not imaginary. In 2011, the city experienced a major cloudburst that flooded streets, disrupted roads and rail, and caused damage estimated at around 6 billion Danish kroner. The new research paper, Artificial Intelligence for Climate Adaptation: Using Reinforcement Learning for Climate Change-Resilient Transport, uses Copenhagen’s inner city as the testbed for a larger question: how should a city decide where, when, and how much to invest in flood adaptation between 2024 and 2100?1 ...

Bending the Beam, Not the Brain: What RL with Perfect Rewards Still Can’t Teach LLMs

Beams are honest objects. Push them, load them, move their supports, and they obey equilibrium equations without theatrical ambiguity. Language models, unfortunately, are less well-behaved. That is what makes BeamPERL a useful paper. It does not test LLM reasoning on a vague benchmark where “correctness” means pleasing a judge, matching a rubric, or sounding sufficiently graduate-school. It asks a compact reasoning model to solve a classical beam statics task: calculate support reactions for a loaded beam. The answers can be checked by a symbolic solver. The reward can be exact. No vibes, no partial credit, no “the answer feels plausible.”1 ...

Dare to Benchmark: Why Data Science Agents Still Trip Over Their Own Pipelines

Spreadsheet work has a special kind of comedy. A person asks an AI agent to load a dataset, clean a few columns, train a model, generate predictions, and save a prediction.csv file. The agent writes plausible Python. The model architecture is reasonable. The explanation sounds confident. Then the whole thing fails because the agent forgot to pass the filename into the execution tool. ...

When Buffers Bite Back: Teaching AI to Respect Pallets in Flexible Job Shops

Factories rarely fail because a machine cannot work. They fail because the machine, the operator, the part, the fixture, the pallet, and the next free square meter of floor space refuse to arrive in the same universe at the same time. That is why a scheduling paper about pallets is more interesting than it sounds. ...

When Failure Pays Dividends: Recycling Reasoning in RLVR with SCOPE

Failure logs are usually where AI teams put the evidence that training was expensive. A reasoning model tries a problem. It gets most of the chain right. Then, near the end, it makes one bad algebraic turn, chooses the wrong case, or quietly invents a rule that mathematics did not approve. Under standard reinforcement learning from verifiable rewards, that rollout receives the same score as nonsense: zero. The model may have climbed nine floors and tripped on the final step; the reward system marks it as indistinguishable from someone who never entered the building. ...

Mind the Gap: Why Agency Isn’t Intelligence (Yet)

A trading bot keeps executing while the market regime changes. A warehouse robot keeps optimizing its route while a sensor slowly drifts. A customer-service agent keeps sounding fluent while the conversation loses coherence one turn at a time. From the outside, the system still looks agentic. It acts. It responds. It may even keep producing acceptable short-term outcomes. The dashboard, naturally, waits until the mess is obvious. Dashboards are polite like that. ...