Robotics

Scalpel Meets Silicon: The Rise of Surgical Foundation Models

Operating rooms do not lack data. They lack data that behaves. A surgical video is not merely a moving picture of tissue, tools, and occasional smoke. It is a compressed record of anatomy, timing, judgment, motor control, institutional habit, and, when things go wrong, irreversible consequence. That makes surgery a deeply inconvenient domain for AI. Standard computer vision likes objects. Surgery gives it interactions. Standard multimodal models like captions. Surgery asks whether the cystic duct is safely exposed before clipping. Lovely. ...

Teaching Reinforcement Learning to Think Before It Acts

Agents are easy to impress and hard to trust. Give a reinforcement learning agent a game, a reward signal, and enough time, and it may discover something brilliant. Or it may discover the dumbest possible way to look successful. In Seaquest, that can mean shooting enemies while ignoring oxygen. In Kangaroo, it can mean punching enemies in a corner instead of climbing toward the joey. Technically, points go up. Strategically, the agent has learned the machine-learning equivalent of optimizing a dashboard while the business burns quietly in the background. ...

When Plans Break: Relaxing Petri Nets for Smarter Sequential Planning

Plans fail in painfully ordinary ways. A warehouse robot cannot both reserve the last pallet slot and keep the aisle clear. A field-service schedule cannot satisfy every customer window after one technician calls in sick. A compliance workflow cannot approve a transaction before the missing document exists, no matter how passionately the dashboard insists on “urgent priority.” ...

Reasoning Is Optional. Optimization Is Not: Rethinking VLA Training with NORD

Driving teams do not pay for reasoning tokens because they enjoy watching a model narrate its inner life. They pay for them because, at least in current VLA training culture, reasoning traces are treated as a bridge between perception and action. The bridge is expensive. A typical reasoning-heavy Vision-Language-Action pipeline for autonomous driving collects large driving datasets, generates dense chain-of-thought-style annotations, supervised-fine-tunes the model, and then applies reinforcement learning to improve driving metrics. It is a respectable pipeline. It is also the kind of pipeline that quietly converts every research win into an invoice. ...

Diffusing to Coordinate: When Multi-Agent RL Learns to Breathe

Robots are easy to imagine as individuals. A quadruped walks. A drone flies. A warehouse arm picks. The business slide is usually kind enough to show one machine, one task, one satisfying arrow from input to output. Reality is less polite. A quadruped is not one decision-maker. It is a committee of limbs negotiating with gravity. A multi-drone system is not one policy with four propellers. It is a moving argument about timing, local perception, shared goals, and what not to crash into. A factory cell with multiple robotic agents is even worse: every local action changes the environment other agents are trying to understand. ...

When Robots Disagree: Taming Gradient Conflicts in Cross-Embodiment Offline RL

A robot fleet looks efficient on a spreadsheet. One warehouse robot logs a few million movements. Another quadruped logs a few million more. A bipedal platform contributes its own dataset. The obvious managerial instinct is to pour everything into one large training pool and let scale do its polite little miracle. This is where robots become less cooperative than cloud software. ...

From Guesswork to Generative Foresight: Why Diffusion Models May Fix Multi-Agent Blind Spots

A warehouse robot turns a corner and sees three things: a shelf edge, a moving cart, and another robot’s partial path. It does not see the blocked aisle behind the shelf. It does not see whether the cart will stop or continue. It does not see the supervisor system’s full map. Still, it must act. ...

Stable World Models, Unstable Benchmarks: Why Infrastructure Is the Real Bottleneck

A robot does not fail politely. It does not say, “I was trained on a slightly different shade of blue.” It just misses the object, pushes the wrong way, or confidently follows a plan that only works in the tidy little universe where the benchmark was born. That is the uncomfortable lesson behind stable-worldmodel-v1, a paper that is less about inventing a new world model and more about asking whether world-model research has been measuring the right thing in the first place.1 ...

Benchmarks Lie, Rooms Don’t: Why Embodied AI Fails the Moment It Enters Your House

The room is not impressed by your leaderboard A robot that performs well on a public benchmark has not necessarily learned how to operate in your house. It may recognize a chair in a dataset. It may answer a visual question about a tidy image. It may even produce a confident paragraph explaining where the coffee mug should be. Then it enters a real room — with mirrors, partial views, cluttered corners, awkward sightlines, and objects that are not positioned for benchmark convenience — and suddenly the “general intelligence” starts behaving like a tourist holding the map upside down. ...

When VR Shooters Meet Discrete Events: Training Security Policies Without Endless Human Trials

Training a security policy sounds simple until the training data involves people role-playing traumatic emergencies inside a virtual school. That is the uncomfortable starting point of this paper. Virtual reality can help researchers study rare and dangerous events under controlled conditions, but it does not solve the scaling problem. Every new intervention, policy variation, or robot behavior still needs another human-subject experiment. That is slow, expensive, ethically constrained, and not exactly a cheerful afternoon in the lab. ...