Cover image

Replay the Losses, Win the Game: When Failed Instructions Become Your Best Training Data

Failure logs are usually treated as evidence for the prosecution. A model is asked to produce a concise compliance summary with three bullet points, mention two risks, avoid prohibited claims, and end with a recommendation. It produces three bullets, correctly identifies the risks, avoids the prohibited claims—and forgets the recommendation. Under a strict binary reward, the response receives a zero. Under a partial-credit reward, it might receive 0.75. The first signal says nothing useful happened. The second says something useful happened, but not precisely what. ...

December 30, 2025 · 18 min · Zelina
Cover image

When Actions Need Nuance: Learning to Act Precisely Only When It Matters

A warehouse robot does not always need elegance. In an open aisle, “move forward a bit” is probably good enough. Near a shelf, a wall, or a human ankle, “a bit” becomes an expensive philosophy. That is the practical problem behind Context-Sensitive Abstractions for Reinforcement Learning with Parameterized Actions, the paper introducing PEARL: Parameterized Extended state/action Abstractions for Reinforcement Learning.1 The paper is not really about making reinforcement learning more fashionable. Mercifully. It is about making action precision conditional. ...

December 28, 2025 · 14 min · Zelina
Cover image

When Policies Read Each Other: Teaching Agents to Cooperate by Reading the Code

A workflow breaks in a familiar way. The planning agent assumes the procurement agent will wait. The procurement agent assumes the planning agent has already revised the forecast. The compliance agent flags the output after both have acted. Everyone had access to the same dashboard. Nobody had access to the thing that actually mattered: the other agent’s decision policy. ...

December 26, 2025 · 19 min · Zelina
Cover image

When One Clip Isn’t Enough: Teaching LLMs to Watch Long Videos Like Adults

Video is a terrible place to hide evidence. Not because the evidence is invisible. Because it is usually obvious only after someone has already found the right minute, the right scene, and the right visual detail. A person reviewing a long customer-support screen recording, a training video, a compliance recording, or a surveillance clip rarely watches everything with equal attention. They skim, localize, zoom in, check the detail, and then answer. Primitive, yes. Effective, also yes. ...

December 24, 2025 · 15 min · Zelina
Cover image

Policy Gradients Grow Up: Teaching RL to Think in Domains

The problem is not that RL cannot plan. It is that it keeps learning the wrong object. A warehouse robot can learn to pick up box A from shelf B and move it to station C. Very impressive, until tomorrow’s warehouse has different boxes, different shelves, and a new station name. The action label changed. The task structure did not. ...

December 23, 2025 · 18 min · Zelina
Cover image

When Benchmarks Rot: Why Static ‘Gold Labels’ Are a Clinical Liability

Clinical AI has a paperwork problem. Not the usual paperwork problem, where doctors drown in documentation and everyone promises that software will save them. The more interesting problem sits one layer below: the paperwork used to judge the software may itself be wrong. That is the uncomfortable center of Scalable Stewardship of an LLM-Assisted Clinical Benchmark with Physician Oversight, a paper that audits MedCalc-Bench, a benchmark for testing whether language models can compute medical risk scores from patient narratives.1 The paper’s target is not a toy dataset. MedCalc-Bench covers 55 medical calculators and includes 10,053 training instances plus 1,047 test instances. Its labels were produced through an LLM-assisted pipeline: GPT-3.5 matched patient contexts to calculator questions, GPT-4 extracted clinical features, and Python scripts aggregated those features into final scores. ...

December 23, 2025 · 15 min · Zelina
Cover image

About Time: When Reinforcement Learning Finally Learns to Wait

Waiting is a decision. That sounds obvious to anyone who has watched a warehouse robot pause at an intersection, a trading system delay execution, or an autonomous vehicle slow down before a pedestrian crossing. In the real world, “do the task” is rarely the whole instruction. The operational instruction is closer to: do the task, in this order, not before this condition, not after that deadline, and preferably without wasting time while pretending that nothing is happening. ...

December 22, 2025 · 16 min · Zelina
Cover image

Same Moves, Different Minds: Rashomon Comes to Sequential Decision-Making

A taxi is a useful little trap. It looks harmless: pick up passengers, drive them to destinations, do not run out of fuel. A small grid-world taxi environment is not exactly the sort of thing that makes executives whisper “agentic transformation” over terrible conference coffee. But that is precisely why it works. Strip away the enterprise theatre, and sequential decision-making becomes easier to see. An agent observes a state, chooses an action, receives the next state, and repeats. If two agents always make the same moves and achieve the same objective, most organizations would treat them as equivalent. Same behavior, same operational meaning. Audit passed. Ship it. ...

December 22, 2025 · 18 min · Zelina
Cover image

Darwin, But Make It Neural: When Networks Learn to Mutate Themselves

A system breaks after a rule changes. The recommendation model suddenly faces a new product catalog. The warehouse routing policy meets a new constraint. A trading bot trained in one market regime walks into another and immediately discovers that yesterday’s “smart behavior” is today’s elegant way to lose money. The usual engineering instinct is to retrain, retune, or ask a human to adjust the knobs. Very modern. Very expensive. Very Tuesday. ...

December 21, 2025 · 17 min · Zelina
Cover image

When Rewards Learn to See: Teaching Humanoids What the Ground Looks Like

Robots do not fall because the word “walk” is ambiguous. They fall because the ground has opinions. A flat floor, a gap, a pile of blocks, and a staircase may all ask for “locomotion,” but they do not ask for the same behavior. One asks for velocity tracking. Another asks for foot placement. Another punishes careless exploration. A staircase, because it has a flair for drama, asks the robot to negotiate gravity one step at a time. ...

December 21, 2025 · 14 min · Zelina