Reinforcement Learning

When VR Shooters Meet Discrete Events: Training Security Policies Without Endless Human Trials

Training a security policy sounds simple until the training data involves people role-playing traumatic emergencies inside a virtual school. That is the uncomfortable starting point of this paper. Virtual reality can help researchers study rare and dangerous events under controlled conditions, but it does not solve the scaling problem. Every new intervention, policy variation, or robot behavior still needs another human-subject experiment. That is slow, expensive, ethically constrained, and not exactly a cheerful afternoon in the lab. ...

Search-R2: When Retrieval Learns to Admit It Was Wrong

Search is supposed to make language models safer. The model does not know something, so it searches. It finds evidence, reasons over that evidence, and gives a better answer. Very civilized. Very responsible. Then the first search query goes slightly wrong. The model retrieves a relevant-looking but misleading paragraph. It builds the next reasoning step around the wrong entity. The next query becomes narrower, but in the wrong direction. The final answer may still sound fluent, because fluency is the one department where language models rarely file sick leave. The actual reasoning chain, however, has already drifted. ...

When Agents Stop Talking to the Wrong People

Communication sounds harmless until the wrong person gets the microphone. That is true in meetings. It is also true in multi-agent AI systems. The polite version says agents “collaborate,” “debate,” and “refine each other’s reasoning.” The less decorative version is that one agent’s output becomes another agent’s input. If the first agent is wrong, confused, strategically misleading, or simply having one of those tiny synthetic breakdowns that LLMs have with impressive confidence, the system has just created a distribution channel for bad judgment. ...

Coaching the Swarm: Why Multi‑Agent RL Finally Scales

Blame is the unglamorous foundation of automation. When a human team misses a deadline, managers rarely ask only, “Did the project succeed?” They ask a more useful question: which handoff failed? Did the analyst misunderstand the data? Did engineering break the pipeline? Did the reviewer approve a bad output because the earlier work looked plausible? This is the difference between evaluation and coaching. Evaluation produces a score. Coaching produces a diagnosis. ...

ThinkSafe: Teaching Models to Refuse Without Forgetting How to Think

A model can be very good at solving math problems and very bad at saying no. That sentence sounds like a joke until it becomes a deployment problem. A reasoning model trained to work harder, think longer, and satisfy difficult prompts may also become more willing to satisfy harmful prompts. The training objective says: solve the problem. The model obeys. Safety, apparently, was not copied on the memo. ...

Grading the Doctor: How Health-SCORE Scales Judgment in Medical AI

Checklist is a boring word. That is why it is useful. In healthcare AI, the glamorous question is whether a model can “reason like a doctor.” The operational question is uglier: did it invent a lab value, miss an emergency referral, overstate certainty, ignore the requested format, recommend unsafe antibiotics, or fail to ask for missing context? ...

MemCtrl: Teaching Small Models What Not to Remember

MemCtrl: Teaching Small Models What Not to Remember A robot assistant walks through a room. It sees a chair from the front. Then from the side. Then from a slightly worse angle. Then the same chair again, because the camera moved while the robot hesitated. In theory, all of this is “context.” In practice, it is mostly noise wearing a productivity badge. ...

When Rewards Learn to Think: Teaching Agents How They’re Wrong

An agent fails a task. It searched the web twice, opened the wrong page, trusted a noisy snippet, wrote a plausible final answer, and lost the point. Traditional reinforcement learning sees one thing: wrong. That is brutally clean, and also rather unhelpful. The agent may have performed three useful steps before collapsing at the fourth. Or it may have wandered confidently through nonsense from the beginning. Sparse final-answer rewards flatten these cases into the same training signal. The scoreboard says “0.” Very educational, in the same way a fire alarm teaches architecture. ...

Learning to Discover at Test Time: When Search Learns Back

A leaderboard usually treats an AI model like a very fast intern: give it a problem, let it try many times, keep the best answer, and politely ignore the graveyard of failed attempts. That is useful. It is also a little strange. A human engineer does not merely try 25,600 variations of a GPU kernel while keeping the same brain. After the first few failures, she learns which bottlenecks matter. After a lucky partial success, she changes how she thinks about the problem. After enough attempts, the search process is no longer just sampling. It has become learning. ...

When LLMs Get a Laptop: Why Sandboxes Might Be the Real AGI Benchmark

Laptop. That is the deceptively simple object hiding inside this paper. Not a magic planner. Not a thousand-tool agent marketplace. Not a baroque workflow with seventeen orchestration layers and a dashboard that looks like a cockpit designed by consultants. A laptop. Or, more precisely, a minimal virtual computer: a sandbox with terminal access, file editing, code execution, persistent files, and the ability to install or fetch resources. In Computer Environments Elicit General Agentic Intelligence in LLMs, Cheng et al. ask a question that looks almost too obvious to be interesting until one remembers how much of the AI industry is still trying to squeeze “agency” out of longer prompts.1 ...