Reinforcement Learning

Flip the Switch: How Heterogeneous Agents Learn to Restore the Grid

A power outage is not one problem. It is a queue of smaller, uglier problems pretending to be one. Which switches can be closed? Which loads should come back first? Which distributed generators are available? Which lines will overheat if a local microgrid gets too ambitious? Which voltage limits will quietly make the elegant restoration plan unusable? In a control room, these questions arrive together, under time pressure, with the usual helpful accompaniment of incomplete information and operational consequences. ...

Mind the Gap: When Robots Learn Social Norms the Human Way

A hotel robot does not need to understand the human soul. It does, however, need to stop cutting between two guests mid-conversation like an intern late for coffee. That distinction matters. Most enterprise conversations about autonomous agents still treat navigation as a logistics problem: reach the destination, avoid collision, minimise delay. Very tidy. Very spreadsheet. Also incomplete. In public-facing environments, a robot can be technically safe and still socially unpleasant. It can avoid hitting people while still making them step back, tense up, or wonder why the expensive machine has the spatial awareness of a supermarket trolley. ...

Reasoning on Mars: How Pipeline-Parallel RL Rewires Multi‑Agent Intelligence

Review is cheap until it has to be correct. That is the uncomfortable lesson behind many agentic AI demos. A system writes an answer. A second model checks it. A third model fixes it. The workflow looks reassuringly managerial, like a tiny consulting firm trapped inside a GPU cluster. But the appearance of oversight is not the same thing as oversight. A weak reviewer can punish a good answer. A weak fixer can damage a nearly correct answer. And if the whole chain receives one final reward, reinforcement learning may end up congratulating the wrong participant. Very corporate, really. ...

Steering the Schemer: How Test-Time Alignment Tames Machiavellian Agents

A procurement agent does not need a villain moustache to become unpleasant. Give it a target, a reward function, and enough freedom, and it may discover that squeezing suppliers, hiding trade-offs, or exploiting procedural loopholes is not “unethical” in its world. It is just efficient. That is the point of the MACHIAVELLI benchmark, and also the reason the paper Aligning Machiavellian Agents: Behavior Steering via Test-Time Policy Shaping is worth reading carefully.1 The paper is not selling a new moral soul for AI agents. Thankfully. We have enough vendors selling souls already. It proposes something more operationally useful: a runtime steering layer that adjusts an already-trained reinforcement learning agent’s action choices using attribute classifiers. ...

Think Outside the Bounding Box: How SpatialThinker Reinforces 3D Reasoning

A warehouse robot does not need poetry. It needs to know whether the box is behind the pallet, whether the cup is closer than the plate, and whether the object it is about to grab is actually reachable rather than merely visible. Small details. Very irritating when ignored. This is where many multimodal models still become strangely philosophical. They can describe an image fluently, infer intent, and produce a confident answer. Then they miss that one object is in front of another. Apparently, “seeing” and understanding space are not the same occupation. ...

When Videos Grow Hands: How PhysWorld Teaches Robots to Stop Hallucinating Physics

Robots are not impressed by nice videos. A generated clip can show a hand placing a book into a shelf, pouring tomatoes from a pan, or sweeping scraps into a dustpan. It can look coherent enough to fool a casual viewer and perhaps even a product demo audience, which is not exactly the highest bar in technology. But a robot does not execute “looks coherent.” It executes poses, contacts, forces, trajectories, collisions, and failures. ...

Graph Minds, Game Moves: How Multi‑Agent Learning Is Quietly Redrawing AI Strategy

A traffic light is not just a traffic light once the other lights start learning. That is the uncomfortable starting point for strategic AI systems. A single model can optimise a route, price, recommendation, allocation, or control policy. But the moment other decision-makers are learning at the same time, the environment stops behaving like scenery. It becomes a cast. Each actor updates, reacts, misreads, cooperates, defects, imitates, or quietly ruins the assumptions in your simulator. Very rude, but entirely realistic. ...

Play by Automata: How Regular Games Rewrites the Rules of General Game Playing

A game engine is usually where rules go to become software. Someone writes the rules, someone else encodes the rules, and an AI agent then spends its expensive little life asking the engine what moves are legal, what happens next, and whether it has already lost. Very glamorous. Very repetitive. General Game Playing tries to remove the hand-built engine from that loop. Instead of building a custom simulator for chess, backgammon, Amazons, Reversi, or some procedural oddity invented on a tired Wednesday afternoon, a game is described in a formal language and a generic system turns that description into something agents can use. ...

Don’t Self-Sabotage Me Now: Rational Policy Gradients for Sane Multi-Agent Learning

Kitchen work is not hard because chopping onions is metaphysically difficult. It is hard because two people must agree, implicitly and quickly, who gets the onion, who holds the plate, who waits by the pot, and who moves out of the corridor before everyone performs a small culinary traffic accident. That is why Overcooked remains such a useful multi-agent benchmark. It turns coordination into something visible. Agents do not merely need to “perform a task”; they need to infer what another agent is about to do and avoid becoming a sentient obstacle. ...

Proof, Policy, and Probability: How DeepProofLog Rewrites the Rules of Reasoning

Proofs are supposed to be the respectable part of AI: tidy, inspectable, and resistant to the usual neural-network fog machine. Then reality turns up, as it so often does, carrying a bill. In neurosymbolic AI, the bill is search. A system may know the rules. It may even combine them with neural perception. But if answering a query requires enumerating a vast space of possible proofs, the promise of “interpretable reasoning” quickly becomes a very elegant way to run out of time. ...