Learning by X-ray: When Surgical Robots Teach Themselves to See in Shadows

X-rays are useful because they are cheap, familiar, and already sitting in the operating room. They are also, inconveniently, shadows.

That is the central tension in Investigating Robot Control Policy Learning for Autonomous X-ray-guided Spine Procedures, a paper that asks whether a robot policy can plan vertebroplasty cannula trajectories from only bi-planar X-ray views—one anterior-posterior view, one lateral view—without CT-based navigation, registration, or a lovingly over-engineered suite of intra-operative infrastructure.¹

The headline number is tempting: the policy achieved 68.5% acceptable first-pass trajectories on held-out synthetic cases. That is not nothing. In surgical robotics, getting a learned policy to infer a plausible 3D path through the spine from sparse 2D radiographs is already a respectable party trick.

But the more important story is the performance ladder:

Evaluation setting	Likely purpose	Reported outcome	What it means
Held-out synthetic NMDID cases	Main evidence	68.5% acceptable trajectories	The imitation policy can learn key elements of X-ray-guided alignment in a controlled simulated environment.
Synthetic fractured anatomy	Robustness / generalisation test	49.2% acceptable trajectories	The policy degrades sharply when anatomy becomes less regular and corridors narrow.
Real bi-planar X-rays	Preliminary sim-to-real test	40.9% acceptable by clinical specialist review	Partial transfer is possible, but the result is too small and manual to imply clinical readiness.

That ladder is the paper’s real contribution. Not “autonomous spine surgery has arrived”, because no, it has not. More accurately: the authors built a serious in silico benchmark for CT-free X-ray-guided surgical policy learning, then showed exactly where the dream starts scraping against anatomy, imaging physics, and safety. Reality, as usual, has not signed the demo waiver.

The evidence matters more than the robot fantasy

The paper investigates vertebroplasty, a procedure where a cannula must be guided through the pedicle into the vertebral body. The geometry is unforgiving. The target corridor is narrow, the anatomy varies across vertebral levels, and the visual input is a pair of low-context 2D projections of a 3D structure.

Traditional navigation approaches often lean on CT, preoperative planning, registration, or specialised intra-operative systems. Those systems can improve geometric understanding, but they bring cost, workflow overhead, equipment demands, and integration friction. The paper’s question is therefore commercially interesting: could a learned system use the fluoroscopy-style views already present in many procedures and still propose useful tool trajectories?

The answer is “partly”, which is not as glamorous as “yes”, but much more useful.

The authors created a controlled simulation pipeline. They started from CT scans, segmented vertebrae using TotalSegmentator, generated vertebral meshes, propagated safe trajectories using statistical shape models, filtered unsafe paths, and rendered bi-planar X-ray sequences using DeepDRR. The result was a dataset of 7,390 simulated episodes across 252 patients, each with navigation, orientation, and insertion phases.

This matters because real surgical trajectory labels are expensive, scarce, and unpleasantly resistant to scale. Simulation is the way around the data bottleneck. The catch, naturally, is that simulation must be realistic enough for the policy to learn something transferable rather than merely becoming a world-class player of Synthetic Spine Simulator 2026.

What the policy actually learns

The model is built around an Action Chunking with Transformers-style imitation learning policy. At each timestep, it sees four radiographic inputs: AP and lateral images, plus cropped versions around the target vertebra. From these, it predicts incremental action chunks for the cannula: translation, rotation, insertion depth, phase flags, and pedicle side.

This is not a model that simply predicts a final target pose. It learns a sequence. That distinction matters.

A final-pose predictor says, “Put the tool there.” A trajectory policy says, “Move like this, adjust like this, then insert like this.” In principle, the latter is closer to the way procedural work actually unfolds. Surgeons do not teleport instruments into ideal configurations; they move, check, align, correct, and only then commit.

The policy’s training process tries to mimic that structure. Each episode is broken into navigation, orientation, and insertion. The robot policy observes the simulated bi-planar radiographs and predicts fine-grained pose changes over time. During evaluation, the policy rolls forward iteratively, with simulated X-rays updated at each step.

That iterative design is both the advantage and the trap. It allows the model to behave like a planning policy rather than a static estimator. But early errors can compound. If the entry point is slightly off before insertion begins, later orientation may still look reasonable while the feasible anatomical corridor has already narrowed. In spine procedures, a few millimetres are not a rounding error. They are the plot.

The first result is promising, but the failure rate is the business signal

On held-out synthetic cases, the policy achieved 68.5% acceptable trajectories, where acceptable means Grade A or B under a modified Gertzbein–Robbins-style breach grading scheme. Grade A means no breach; Grade B means a breach of up to 2 mm.

The breakdown is revealing:

Grade	Interpretation	Share
A	No breach	52.4%
B	Minor breach, ≤2 mm	16.1%
C	2–4 mm breach	6.9%
D	4–6 mm breach	0.3%
E	≥6 mm breach or extra-pedicular	24.4%

The good news: more than half of predicted trajectories had no breach, and the model often maintained intra-pedicular containment. The bad news: nearly a quarter were Grade E. That is not a deployable autonomy figure; it is a feasibility benchmark with a flashing red arrow labelled “entry point”.

The geometric analysis makes the issue sharper. For successful cases that entered the pedicle or vertebra, the mean entry-point error was 5.46 ± 3.67 mm, while the mean angular offset was 3.53 ± 2.19°. Prior SSM-based automatic trajectory work cited by the authors reported roughly 2.5 mm entry-point deviation and 3.5° angular difference. So the policy’s angular estimation is broadly competitive, but its entry localisation is materially weaker.

That asymmetry is the centre of the paper. The model is not simply “bad at surgery”. It is specifically worse at putting the first contact point in the right place than at estimating the general direction of travel. In a narrow pedicle corridor, that difference is the difference between a plausible plan and a breach.

Fractured anatomy turns difficulty into brittleness

The fractured-anatomy test is best read as a robustness and generalisation test, not as a second main thesis.

The authors evaluated the policy on 63 unseen CTs containing vertebral fractures, producing 962 additional episodes. Acceptance fell to 49.2%, a 16.2 percentage-point drop from the synthetic held-out cases.

This is not surprising, but it is informative. Fractures distort the regular geometry that the model has learned to exploit. They reduce the valid corridor and make sparse radiographic reasoning even more ambiguous. Performance degradation was especially pronounced in upper thoracic levels, where vertebrae are smaller and anatomical variation leaves less margin for error.

There is also an evaluation caveat: the fractured dataset had an imbalanced vertebral distribution, with fewer upper thoracic cases. That weakens level-specific conclusions. Still, the direction is hard to ignore. When anatomy becomes less textbook, the learned policy becomes less reliable.

For business readers, this is the difference between “works on representative workflow” and “works in the messy case mix that hospitals actually bill for”. Many medical AI systems look better when the anatomy behaves politely. The anatomy, sadly, has no investor relations department.

The ablations explain the model’s dependency on visual context

The ablation table is not decorative. It tells us what the policy needs in order to function.

Configuration	Likely purpose	Angular offset	Entry-point distance	Acceptable
Baseline: AP + LAT + crops	Main configuration	3.53 ± 2.19°	5.46 ± 3.67 mm	68.5%
Centre initialisation	Sensitivity to start pose	6.32 ± 4.21°	6.76 ± 4.70 mm	41.0%
AP-only	Input ablation	5.52 ± 4.31°	8.84 ± 5.16 mm	17.6%
LAT-only	Input ablation	—	—	0.0%
Without cropped views	Input-detail ablation	5.09 ± 3.78°	7.63 ± 4.71 mm	42.2%

The obvious lesson is that both views matter. AP-only collapses to 17.6%; LAT-only cannot execute insertions into the vertebra at all. Removing cropped guidance also hurts badly, dropping acceptance to 42.2%.

The less obvious lesson is that initialisation matters in a slightly uncomfortable way. Starting from a fixed central pose above the vertebra reduced acceptance to 41.0%. The paper notes that this increased the likelihood of central foramen breaches. In other words, a seemingly cleaner starting condition was not necessarily easier. The model had learned a behaviour distribution under randomised starts; changing that distribution altered the failure pattern.

This matters operationally because surgical robotics products do not live inside tidy benchmark assumptions. If a policy is sensitive to starting pose, view alignment, crop placement, and phase timing, then the commercial system must either control those conditions tightly or detect when they are outside tolerance. Otherwise, “AI navigation” becomes a very expensive way to discover that geometry still exists.

The real-X-ray test is encouraging, small, and not enough

The real-X-ray evaluation is the most marketable part of the paper and the easiest to overread.

The authors tested on two bi-planar X-ray pairs from the BUU-LSPINE dataset. After manual alignment and cropping around four lumbar vertebrae, they targeted left and right pedicles with five rollouts each, producing 80 simulated insertions. The policy used one AP and one lateral acquisition, with simulated cannula motion overlaid onto the real radiographs.

Fourteen rollouts failed to initiate insertion and were excluded. The remaining 66 attempts were graded manually in two rounds: first by two researchers with vertebroplasty domain knowledge, then by an independent clinical specialist.

The results:

Reviewer	Clinically acceptable	Borderline	Acceptance
Researchers	9	14	34.8%
Clinical specialist	21	6	40.9%

This does show partial sim-to-real transfer. The model trained in simulation can sometimes generate plausible trajectories on real radiographs. That is the important achievement.

But the boundary is firm. The real-X-ray sample is tiny. The evaluation lacks CT ground truth. The images require manual alignment and cropping. The grading is based on bi-planar views rather than full 3D verification. And 14 of 80 rollouts failed to initiate insertion before grading.

The paper also identifies three dominant real-X-ray failure modes: pedicle corridor breach from entry-point deviation, failure to initiate insertion due to poor alignment, and post-insertion pose updates. The last one is particularly telling. Post-insertion updates appeared in 15.2% of cases and could turn initially acceptable trajectories into failures by causing unrealistic over-advancement or loss of containment.

That is not just a model-performance issue. It is a control-policy governance issue. Once insertion begins, the system needs phase constraints. A robot should not keep “improving” the plan after the act has become physically committed. Optimisation after commitment is a lovely metaphor for bad management, but a dangerous property in surgery.

The business value is workflow-light guidance, not autonomous replacement

The most plausible business pathway is not “replace the surgeon”. It is not even “fully autonomous vertebroplasty”. The paper itself is more sensible than that.

The realistic value proposition is a lightweight planning and alignment layer that uses fluoroscopy streams already present in the operating room. If such systems mature, they could reduce dependence on CT-based navigation infrastructure, lower setup friction, and help surgeons align trajectories with less workflow disruption.

That is a device-maker and hospital-operations story, not a science-fiction story.

What the paper directly shows	Cognaptus business inference	What remains uncertain
ACT-style imitation policies can learn some X-ray-guided cannula alignment behaviour in simulation.	Simulation can reduce the data bottleneck for surgical AI development.	Whether simulated demonstrations can cover enough real anatomical and imaging variation.
Acceptance reaches 68.5% in held-out synthetic cases.	There is a credible technical basis for assistive trajectory planning research.	The rate is not clinically sufficient for autonomy.
Performance drops to 49.2% on fractured anatomy.	Edge-case anatomy will drive product risk and validation cost.	How much stronger priors, better imaging simulation, or real demonstrations improve robustness.
Real-X-ray rollouts reach 40.9% specialist-reviewed acceptance.	Sim-to-real transfer may be feasible without retraining in some constrained cases.	The sample is too small and manually processed to support deployment claims.
Entry-point localisation and phase control dominate failures.	Product value may come from hybrid human-AI workflow: human selects/approves entry, AI proposes alignment.	Whether users will trust, correct, and integrate such assistance in live procedures.

For hospitals, the operational appeal would be reduced infrastructure burden. CT-based systems can require specialised hardware, registration workflows, staff training, and procedural adjustments. A system that works with AP/LAT fluoroscopic views could fit more naturally into existing practice.

For surgical robotics companies, the appeal is modular. A policy like this could become a planning assistant, trajectory sanity-checker, or intra-operative alignment recommender before it becomes anything resembling autonomy. That is probably the correct adoption sequence. In high-risk medicine, the road to autonomy runs through decision support, supervised execution, and boring validation. Very boring. Which is precisely why it matters.

The technical bottleneck is not “more AI”; it is better geometry under sparse evidence

The paper is useful because it refuses to let transformer enthusiasm do all the work.

ACT-style policies have performed well in video-based robotics, where the model can exploit rich visual texture, multiple camera views, and dense temporal feedback. X-ray imaging does not offer that luxury. It gives low-contrast projections, limited anatomical context, and ambiguity about depth. The model must infer 3D structure from sparse 2D evidence.

That difference breaks some assumptions behind imitation learning. A robot arm manipulating visible objects on a table sees surfaces, edges, occlusion cues, and object motion. A cannula navigating a vertebral pedicle sees shadows of bone through overlapping anatomy. The input is not merely noisy. It is underdetermined.

This is why the paper’s failure modes point toward stronger anatomical priors, domain-adaptive training, phase constraints, and possibly richer encoders. The model needs more than behavioural cloning from synthetic trajectories. It needs a way to reason about permissible anatomy, not just imitate movements that were safe in generated demonstrations.

The authors also trained without domain randomisation. That choice helps isolate the feasibility question, but it leaves sim-to-real robustness underdeveloped. Future systems will need to vary imaging conditions, projection geometry, patient pose, anatomy, tool visibility, and noise. Otherwise, the model learns one universe and is then asked to operate in another. Medical devices are rarely cleared for use in alternate universes.

Boundaries that materially change interpretation

Several limitations are not footnotes; they define what the result means.

First, the strongest evidence is synthetic. The main performance number, 68.5%, comes from a controlled simulated environment built from CT-derived anatomy and DeepDRR-rendered radiographs. That environment is carefully designed, but it is still not live surgery.

Second, real-X-ray evaluation is preliminary. It uses two AP/LAT pairs, manual alignment and cropping, no CT ground truth, and manual grading from bi-planar views. The result supports feasibility, not readiness.

Third, entry-point error is the central failure mode. The policy’s angular estimates are relatively good, but entry-point localisation remains weak. For transpedicular access, entry is not a minor detail. It is the gatekeeper of safety.

Fourth, the policy needs phase discipline. Post-insertion updates can convert plausible trajectories into failures. A practical system would need explicit constraints that lock or supervise behaviour once insertion begins.

Fifth, the radiation workflow is unresolved. Iterative X-ray-guided planning can imply frequent imaging updates. The real-X-ray experiment suggests full trajectories can sometimes be proposed from two initial views using tool augmentation, but translating that into a safe intra-operative workflow remains open.

These are not reasons to dismiss the paper. They are reasons to value it. A useful feasibility study does not merely produce a good number; it tells the field which number is still bad.

The more honest future is assistive, supervised, and anatomy-aware

The misconception to avoid is simple: this paper does not demonstrate a deployable autonomous spine robot. It demonstrates that learned policies can recover some surgeon-like alignment behaviour from sparse radiographs, and that the remaining gap is specific enough to be attacked.

That specificity is the contribution.

The path forward is likely hybrid. Human clinicians may remain responsible for entry-point selection and insertion start. The AI system may propose candidate trajectories, simulate likely tool motion, flag risky corridors, or provide alignment guidance under supervision. Corrections from surgeons could become additional demonstrations, gradually improving the policy in settings where real-world data can be collected ethically and safely.

The deeper lesson extends beyond vertebroplasty. Many medical AI opportunities sit exactly here: not in replacing expert judgement, but in translating sparse, messy clinical signals into structured assistance. The paper shows that imitation learning can enter this territory, but also that it cannot simply import assumptions from consumer robotics and expect anatomy to cooperate.

X-rays are shadows. The robot can learn from them. It just does not yet understand them well enough to be left alone in the dark.

Sources

Cognaptus: Automate the Present, Incubate the Future.

Florence Klitzner, Blanca Inigo Romillo, Benjamin D. Killeen, Lalithkumar Seenivasan, Michelle Song, Rebecca Choi, Majid Khan, Axel Krieger, and Mathias Unberath, “Investigating Robot Control Policy Learning for Autonomous X-ray-guided Spine Procedures,” arXiv:2511.03882v2, 2026. https://arxiv.org/abs/2511.03882 ↩︎

The evidence matters more than the robot fantasy#

What the policy actually learns#

The first result is promising, but the failure rate is the business signal#

Fractured anatomy turns difficulty into brittleness#

The ablations explain the model’s dependency on visual context#

The real-X-ray test is encouraging, small, and not enough#

The business value is workflow-light guidance, not autonomous replacement#

The technical bottleneck is not “more AI”; it is better geometry under sparse evidence#

Boundaries that materially change interpretation#

The more honest future is assistive, supervised, and anatomy-aware#

Sources#