Reinforcement Learning

When Benchmarks Rot: Why Static ‘Gold Labels’ Are a Clinical Liability

Clinical AI has a paperwork problem. Not the usual paperwork problem, where doctors drown in documentation and everyone promises that software will save them. The more interesting problem sits one layer below: the paperwork used to judge the software may itself be wrong. That is the uncomfortable center of Scalable Stewardship of an LLM-Assisted Clinical Benchmark with Physician Oversight, a paper that audits MedCalc-Bench, a benchmark for testing whether language models can compute medical risk scores from patient narratives.1 The paper’s target is not a toy dataset. MedCalc-Bench covers 55 medical calculators and includes 10,053 training instances plus 1,047 test instances. Its labels were produced through an LLM-assisted pipeline: GPT-3.5 matched patient contexts to calculator questions, GPT-4 extracted clinical features, and Python scripts aggregated those features into final scores. ...

About Time: When Reinforcement Learning Finally Learns to Wait

Waiting is a decision. That sounds obvious to anyone who has watched a warehouse robot pause at an intersection, a trading system delay execution, or an autonomous vehicle slow down before a pedestrian crossing. In the real world, “do the task” is rarely the whole instruction. The operational instruction is closer to: do the task, in this order, not before this condition, not after that deadline, and preferably without wasting time while pretending that nothing is happening. ...

Same Moves, Different Minds: Rashomon Comes to Sequential Decision-Making

A taxi is a useful little trap. It looks harmless: pick up passengers, drive them to destinations, do not run out of fuel. A small grid-world taxi environment is not exactly the sort of thing that makes executives whisper “agentic transformation” over terrible conference coffee. But that is precisely why it works. Strip away the enterprise theatre, and sequential decision-making becomes easier to see. An agent observes a state, chooses an action, receives the next state, and repeats. If two agents always make the same moves and achieve the same objective, most organizations would treat them as equivalent. Same behavior, same operational meaning. Audit passed. Ship it. ...

Darwin, But Make It Neural: When Networks Learn to Mutate Themselves

A system breaks after a rule changes. The recommendation model suddenly faces a new product catalog. The warehouse routing policy meets a new constraint. A trading bot trained in one market regime walks into another and immediately discovers that yesterday’s “smart behavior” is today’s elegant way to lose money. The usual engineering instinct is to retrain, retune, or ask a human to adjust the knobs. Very modern. Very expensive. Very Tuesday. ...

When Rewards Learn to See: Teaching Humanoids What the Ground Looks Like

Robots do not fall because the word “walk” is ambiguous. They fall because the ground has opinions. A flat floor, a gap, a pile of blocks, and a staircase may all ask for “locomotion,” but they do not ask for the same behavior. One asks for velocity tracking. Another asks for foot placement. Another punishes careless exploration. A staircase, because it has a flair for drama, asks the robot to negotiate gravity one step at a time. ...

Stop or Strip? Teaching Disassembly When to Quit

A battery pack arrives at an end-of-life processing facility. The easy story says the operator should recover as much value as possible while doing the sustainable thing. The harder story starts five minutes later, when someone has to decide whether to stop, reuse the pack, remove the cover, strip the thermal shield, extract a module, test it, recycle it, or finally admit defeat and dispose of what remains. ...

Adversaries, Slices, and the Art of Teaching LLMs to Think

A math tutor does not wait until the end of a two-page solution, circle the final answer, and say “wrong.” At least, not a good one. The useful tutor interrupts earlier. This line follows. That parity condition does not. This factorization is legal, but the conclusion you drew from it is not. The feedback is local, not theatrical. It tells the student where the reasoning began to rot, before the final answer becomes merely the visible corpse. ...

Stepwise Think-Critique: Teaching LLMs to Doubt Themselves (Productively)

The useful part of doubt is timing Doubt is not useful after the invoice is paid, the client report is sent, or the model has already produced a confident wrong answer with twelve decorative paragraphs of reasoning. At that point, “let us verify” becomes less like quality control and more like archaeology. ...

Picking Less to Know More: When RAG Stops Ranking and Starts Thinking

Search is not judgment Search is easy to admire because it produces something visible. A ranked list. A bigger context window. A satisfying pile of passages that says, “Look, we retrieved evidence.” Very comforting. Also not the same as knowing what evidence is actually needed. That distinction is the core of Context-Picker: Dynamic Context Selection Using Multi-stage Reinforcement Learning.1 The paper studies a familiar RAG problem: if a system retrieves too little, it misses the answer; if it retrieves too much, it drags in distractors, repeats, weakly related fragments, and the usual long-context swamp where useful evidence politely disappears in the middle. ...

When Reasoning Needs Receipts: Graphs Over Guesswork in Medical AI

Diagnosis is not a magic word. In medicine, the answer matters, but the path to the answer matters almost as much. A model that says the correct disease name after skipping the decisive evidence is not “reasoning efficiently.” It is guessing with bedside manner. That is the problem addressed by MedCEG: Reinforcing Verifiable Medical Reasoning with Critical Evidence Graph.1 The paper’s core claim is not simply that a medical LLM can score higher on benchmarks. That would be useful, but not especially surprising. The more interesting move is architectural: the authors try to make clinical reasoning trainable by turning it into a graph of required evidence, then rewarding the model for following that graph. ...