Safe Hands, Unsafe Audit: Why Robot Success Does Not Prove Robot Safety

A robot finishes the task. It picks, places, inserts, wipes, stacks, or assembles. The demo video looks clean. The benchmark reports success. Everyone exhales.

This is exactly where the safety argument should begin, not end.

The awkward truth about embodied AI is that a robot can complete a task while accumulating risk along the way. It may interpret the instruction too narrowly, skip an implicit prerequisite, recover from a mistake in a physically unstable way, apply too much force, or pass through a near miss that the final success metric politely declines to remember. The task is done. The audit trail is missing. Convenient, in the same way a black box with wheels is convenient.

The arXiv survey Safe Embodied AI for Long-horizon Tasks: A Cross-layer Analysis of Robotic Manipulation gives this problem a useful structure.¹ It does not introduce a new robot policy, a new benchmark leaderboard, or a triumphant “our method beats yours” table. Its contribution is more operationally important: it maps where safety interventions enter a long-horizon manipulation system, what kind of hazard each intervention addresses, and what kind of evidence each safety claim actually supports.

That sounds bureaucratic. It is not. It is the difference between saying “the robot succeeded” and saying “the robot remained within justified safety assumptions while succeeding.” The first sentence belongs in a demo reel. The second belongs in a deployment file.

Long-horizon manipulation is where small mistakes learn patience

The paper chooses long-horizon robotic manipulation as its anchor domain because it compresses several safety problems into one uncomfortable package. A robot does not merely classify an image or answer a prompt. It reasons over instructions, interprets a scene, decomposes a task, selects actions, moves hardware, touches objects, and sometimes interacts near humans. Each stage can be locally plausible while still making the later rollout less safe.

That is the key mechanism. Safety risk is not always visible at the moment it is created.

A misgrounded instruction may look harmless during planning. A skipped precondition may remain invisible until the robot reaches a later subtask. A policy may appear fluent while drifting away from the actual procedural state. A motion may be geometrically correct but physically too rigid under contact. In long-horizon manipulation, errors compound across time and across abstraction layers. Safety becomes less like a checkbox and more like bookkeeping with consequences.

The survey’s first useful move is to separate four kinds of safety that are often lazily merged:

Safety type	What it concerns	Typical hidden failure
Semantic safety	Whether the robot understood the task, objects, affordances, and constraints	The robot follows the wrong interpretation of a plausible instruction
Procedural safety	Whether subtasks, preconditions, and recoverable states are preserved	The robot completes step 3 after silently invalidating step 2
Operational safety	Whether continued autonomy is still justified	The robot should ask, halt, hand off, or replan, but continues anyway
Physical safety	Whether contact, force, motion, and interaction remain within safe bounds	The robot completes the task while overloading, jamming, slipping, or damaging

This categorization matters because benchmark success mostly sees the endpoint. Safety, especially in physical systems, lives in the path.

The paper’s figures should be read accordingly. Figure 1 is a motivation diagram, not experimental evidence: it illustrates how hidden risk can accumulate across grounding, action generation, and contact-rich execution. Figures 2 through 6 are roadmap figures: they organize the safety lifecycle into planning-time, policy-time, execution-time, evaluation, and future directions. The tables are also classification tools rather than empirical result tables. Table 1 positions manipulation against navigation and locomotion as a denser safety testbed. Table 2 positions the survey against adjacent surveys. Table 3 defines evidence categories. Tables 4 and 5 organize mechanism families. Table 6 summarizes benchmark evidence. None of these should be mistaken for an ablation study. There is no “variant A beats variant B” experiment hiding in the appendix. The paper’s purpose is taxonomy and evidence discipline.

That is already a contribution, because robotics safety discussions often mix mechanism, evidence, and aspiration in the same sentence. Very efficient, if the goal is confusion.

The first category: planning-time safety catches bad tasks before they become bad motion

Planning-time safety is where the robot decides what task it is actually being asked to perform. The survey breaks this layer into goal grounding, constraint interpretation, subtask structuring, formal specification, spatial grounding, and task-and-motion planning support.

This layer answers questions such as:

Did the robot ground the instruction in the correct objects and initial state?
Did it turn natural-language restrictions into explicit constraints?
Did it decompose the task into a valid sequence of subtasks?
Did it preserve temporal and spatial requirements?
Did it check whether the abstract plan can become executable motion?

The business version is simpler: planning-time safety is the procurement and compliance layer of a robot’s mind. Before anything moves, the system must know what it is allowed to do, what it is not allowed to do, and whether the proposed task has a plausible path to execution.

The paper is careful about the evidence boundary here. Planning-time methods can provide formal or structured checks: temporal logic specifications, precondition/postcondition verification, symbolic planning, constraint satisfaction, task-and-motion planning, and spatial grounding. These tools are valuable because they can reject unsafe or infeasible plans before hardware starts moving.

But a verified plan is not a safe rollout.

This is the category error the paper keeps correcting. A plan can be logically consistent under its encoded predicates while still relying on incomplete perception, missing geometric detail, stale world states, or unmodeled contact assumptions. A temporal logic formula may be syntactically valid but semantically wrong. A task sequence may satisfy all visible preconditions while ignoring that a cabinet is jammed, an object is too heavy, or a grasp is unstable. A spatially grounded plan may identify the correct object part and still fail under clutter, robot-specific kinematics, or force dynamics.

Planning-time safety therefore provides upstream assurance. It is not a certificate for downstream physical behavior.

This distinction is especially important for foundation-model robotics, where natural language creates a seductive interface. If the robot can explain its plan, humans may assume the plan is safe. But explanation is cheap. Constraint preservation is expensive. The gap between the two is where many future incident reports will be written.

The second category: policy-time safety shapes what the robot is likely to propose next

Policy-time safety sits between abstract planning and physical execution. It concerns the policy’s action-generation process: how candidate actions are proposed, restricted, aligned, or shaped before they are committed to the environment.

The survey usefully treats policy generation as a constrained decision process, even if not every reviewed method literally fits the same mathematical template. The core idea is that a policy has a class of possible outputs, a set of constraints, and an objective it has been trained or steered to optimize. Safety can intervene at any of those points.

This gives us three practical subcategories.

First, the policy interface matters. A policy may output continuous controls, action tokens, skill calls, executable programs, latent actions, or diffusion trajectories. Each interface creates different safety handles. Discrete tokens can be masked. Program lines can be checked against APIs. Continuous trajectories can be filtered, projected, or optimized. Latent representations may improve flexibility while hiding safety-relevant variables inside the model’s internal fog machine.

Second, constraints can be injected before action commitment. The survey covers temporal-logic constraints, action masking, safety layers, constrained decoding, safe reinforcement learning, and certificate-guided methods. These mechanisms can block unsafe candidates under specified assumptions. The stronger versions borrow from control theory, reachability, barrier functions, and constrained optimization.

Third, objectives can be shaped. Preference learning, language-guided rewards, video-language critics, intervention-derived rewards, and stage-aware reward models can push policies away from unsafe or human-dispreferred behavior. For long-horizon manipulation, this matters because final success rewards often hide procedural mistakes. A robot may be rewarded for reaching the endpoint while learning very little about how not to make the journey stupid.

Here again the paper’s evidence discipline is the point. Policy-time mechanisms can reduce unsafe proposals, lower cumulative safety costs, improve alignment with feedback, or make progress more inspectable. But much of this evidence is empirical and distribution-dependent. Preference alignment does not exclude unsafe candidates by construction. Safe reinforcement learning may reduce expected violation costs while still permitting transient state-wise violations. Learned certificates and latent safety filters can break under distribution shift. Stage-aware reward modeling may reduce procedural drift on evaluated tasks without proving the policy has captured all hidden prerequisites.

So policy-time safety is not a moral education program for robots. It is a set of interfaces and objective constraints that make unsafe behavior less likely under specific assumptions. The assumptions matter. Annoyingly, they usually matter more than the marketing deck.

The third category: execution-time safety is where assumptions meet furniture

Execution-time safety begins when the robot is already acting in the world. At this point, the system is no longer dealing with ideal plans or clean representations. It faces sensor noise, occlusion, distribution shift, unexpected object motion, contact uncertainty, and the special comedy of real-world hardware.

The survey divides execution-time safety into three large families: runtime risk assessment, runtime adaptation and task restoration, and contact-rich physical interaction safety.

Runtime risk assessment is the early-warning layer. It includes anomaly detection, multimodal monitoring, uncertainty scoring, spatio-temporal reasoning, semantic misalignment detection, and failure diagnosis. The key question is not merely whether the robot failed, but whether it can detect a dangerous deviation early enough to prevent escalation.

That timing issue is crucial. A failure detector that recognizes disaster after the object breaks is not a safety mechanism. It is a historian.

Diagnosis goes one step further. It asks what kind of failure occurred: semantic misgrounding, state drift, precondition violation, occlusion, unstable contact, or something else. This matters because different failures require different responses. A robot that misread the object needs a different correction from a robot that is applying too much force during insertion. Treating every warning as “something went wrong” is operationally neat and practically useless.

After detection and diagnosis comes intervention. Shielding can block unsafe actions under modeled constraints. Runtime steering can bias the robot toward safer continuations. Human handoff can stop autonomy when uncertainty or precision demands exceed acceptable bounds. Interactive correction can repair a local failure without full takeover. Replanning can revise the active task sequence. Recovery systems can rewind, restore, or generate corrective actions.

Execution-time safety is therefore the resilience layer. It accepts that upstream safety will be incomplete. This is not pessimism. It is engineering.

The final execution category is contact. The paper treats contact-rich interaction as its own safety regime, and rightly so. A robot can follow the right plan, choose the right object, and move toward the right pose while still being unsafe because it pushes too hard, reacts too late, jams a part, loses tactile information, or fails to modulate compliance.

This is where physical safety becomes less philosophical. Force, torque, slip, contact duration, impulse, pressure, and stability are not vibes. They are measurements. The survey reviews adaptive compliance, force-aware learning, tactile and force-tokenized foundation models, control-theoretic constraints, and hierarchical refinement. It also emphasizes the gap: much contact-rich evidence remains empirical. Formal guarantees for high-dimensional, long-horizon, contact-rich manipulation remain thin.

In business terms, execution-time safety is the layer that decides whether a robot should continue, ask, stop, hand off, soften, recover, or abort. That decision cannot be outsourced to final task success. The final task success metric is not there when the gripper is already squeezing too hard.

The fourth category: evaluation must remember the route, not just admire the destination

The survey’s benchmark discussion is where the misconception becomes most visible. Many manipulation benchmarks evaluate whether a robot completes a task, reaches a goal state, or progresses through a sequence. These capability benchmarks are necessary. They are also insufficient for safety.

The paper distinguishes several evidence objects:

Evidence object	What it can support	What it does not prove
Plan-level evidence	Unsafe intent was rejected; constraints or specifications were satisfied before rollout	The robot will execute safely under perception error, policy drift, or contact uncertainty
Policy-level evidence	A constrained or aligned policy reduced unsafe behavior in evaluated rollouts	Risk onset, intervention timing, and physical severity are controlled
Runtime-level evidence	The robot can detect, diagnose, intervene, or recover during execution	Contact forces or physical hazards were strictly bounded
Contact-level evidence	Physical interaction variables such as force, slip, pressure, or duration were measured	The whole long-horizon procedure is safe across semantic, procedural, and operational layers

This table is the practical heart of the paper. Safety evidence is not interchangeable.

A safe high-level plan does not guarantee safe contact. A policy with low aggregate safety cost may still produce short dangerous spikes. A successful recovery may pass through unsafe intermediate states. A contact-aware controller may regulate force while the robot pursues the wrong task. Each evidence type is useful. None is sovereign.

The survey also notes that newer benchmarks are moving beyond binary success. Some report progress scores, stagewise progression, collision events, object slips, smoothness proxies, failure annotations, early-warning metrics, intervention rates, recovery success, and contact-force measures. These are steps toward procedural safety observability: recording how risk appears, evolves, is detected, and is mitigated across the rollout.

That phrase deserves attention. Procedural safety observability means the robot’s safety history becomes inspectable. Not just whether the task ended well, but whether it was safe to continue at each important point.

For deployed robotics, this is not academic neatness. It is the difference between being able to audit an incident and merely replaying a video while everyone guesses which layer lied first.

A useful business translation: safety claims need labels

The paper is a survey, not a certification standard. It does not rank vendors, validate a product, or tell warehouse operators which robot to buy next quarter. Its business value is instead diagnostic. It gives companies a way to label safety claims before those claims harden into procurement language.

A practical assurance checklist might look like this:

Deployment question	Required safety evidence	Evidence boundary
Does the robot understand the task and constraints?	Grounding checks, specification translation, unsafe-plan rejection	Does not prove safe motion or contact
Can the robot produce valid subtask sequences?	Precondition/postcondition validation, temporal logic checks, task decomposition tests	Depends on completeness of predicates and world model
Are action proposals constrained before execution?	Constrained decoding, safety layers, safe RL, policy filters	Usually bounded by specification quality and training distribution
Can the robot detect drift during rollout?	Runtime anomaly detection, calibrated uncertainty, failure diagnosis	Detection is not intervention; false positives and latency matter
Can the robot recover without creating new hazards?	Replanning tests, recovery traces, intervention logs, restoration efficiency	Binary recovery success may hide unsafe intermediate behavior
Is contact physically controlled?	Force, torque, slip, pressure, contact duration, threshold violation metrics	Sensor quality, embodiment, and task regime limit generalization
Are near misses and unsafe successes recorded?	Procedural safety logs and incident-learning records	Requires organizational discipline, not just model performance

For robotics buyers, this shifts the conversation from “What is the success rate?” to “Success under which safety evidence?” For robotics vendors, it prevents a common strategic mistake: selling task completion as if it were safety assurance. That mistake may work in a demo. It ages poorly in deployment.

For internal governance teams, the paper suggests a more precise documentation structure. A robot safety file should not merely list test scores. It should map claims to layers:

Planning-time claim: what was specified, grounded, verified, and rejected before rollout.
Policy-time claim: what constraints, objectives, filters, or alignment mechanisms shaped action proposals.
Execution-time claim: what monitors, interventions, handoffs, recovery procedures, and contact regulators operated during rollout.
Evidence boundary: whether the claim is formal, statistical, or empirical, and under what assumptions.
Revalidation trigger: when changes in task, workspace, sensor, gripper, policy, object class, or human proximity invalidate the old evidence.

This is not glamorous. Neither is insurance. Both become interesting after the first expensive failure.

The paper’s strongest warning is about evidence transfer

The most useful idea in the survey is evidence boundary. The authors separate formal guarantees, statistical safety evidence, and empirical safety evidence.

Formal evidence can prove safety or correctness within explicit assumptions, such as temporal logic verification, reachability, or control barrier functions. Its strength is rigor. Its weakness is the abstraction gap. The real world has a rude habit of not reading your model assumptions.

Statistical evidence can support probabilistic claims, such as calibrated failure detection or confidence-bounded intervention. Its strength is uncertainty awareness. Its weakness is regime dependence. Calibration can decay when the task, embodiment, sensor configuration, or environment changes.

Empirical evidence observes behavior in simulations, datasets, benchmarks, and case studies. Its strength is practical relevance. Its weakness is coverage. It cannot guarantee long-tail safety merely because the evaluated cases behaved nicely.

The danger is not using weak evidence. Weak evidence is often all a field has at early stages. The danger is promoting weak evidence into a stronger claim than it supports.

The paper repeatedly warns against this promotion. Robustness is not automatically safety. Capability is not automatically safety. Alignment is not automatically safety. A foundation model’s better task performance does not become safety evidence unless the metric is explicitly tied to hazard reduction, unsafe-state avoidance, recoverability, intervention quality, or physical risk.

This is the kind of sentence that should be printed on procurement templates, preferably near the signature line.

What the paper directly shows, what Cognaptus infers, and what remains uncertain

The paper directly shows a fragmented research landscape. Planning-time, policy-time, execution-time, benchmark, and contact-safety research are all advancing, but they often produce layer-local claims. The survey’s contribution is to organize those claims by intervention locus and evidence boundary, then show why they do not automatically compose into end-to-end safety assurance.

Cognaptus infers a governance principle from this structure: embodied AI deployment should be managed as an evidence-composition problem. A robot should not be described as “safe” in general. It should be described as safe with respect to a defined task class, workspace, embodiment, sensor stack, contact regime, intervention protocol, and evidence level. Anything broader is not necessarily false. It is just under-specified, which in physical systems is a fancy way of saying “waiting for a liability event.”

What remains uncertain is substantial. The survey does not provide a universal metric for safe long-horizon manipulation. It does not resolve how to combine formal, statistical, and empirical claims into a single assurance case. It does not show that current VLA systems can be made safe through one preferred architecture. It does not eliminate the reality gap between simulation and hardware, or the embodiment gap between robot platforms. It also does not settle how much human intervention is operationally acceptable before “autonomous robot” becomes “expensive remote-control intern.”

Those boundaries do not weaken the paper. They clarify what kind of paper it is. It is a map of evidence discipline, not a destination.

The deployment lesson: build the safety record before the robot needs an excuse

The survey’s future directions are unusually useful for business readers because they translate research gaps into organizational requirements.

First, safety information must survive abstraction. A phrase like “insert gently” should not disappear as the task moves from language to plan to trajectory. It should become contact-phase recognition, force limits, compliance parameters, and runtime checks. A phrase like “avoid the red region” should preserve symbolic region identity, geometric margins, and trajectory-level exclusion.

Second, safety evidence must survive reality shift. A detector calibrated on one robot, gripper, sensor set, object class, or workspace may not remain valid elsewhere. Companies love platform reuse. Safety evidence loves context. This relationship will be tense.

Third, systems need clearer mappings from signals to interventions. Not every uncertainty signal should trigger a full stop. Not every anomaly needs human handoff. Not every contact deviation requires replanning. A mature system should know when to ask, when to halt, when to shield, when to recover, and when to give the human operator the wheel.

Fourth, benchmarks need procedural observability. A rollout record should distinguish safe success, unsafe success, safe failure, and unsafe failure. That distinction is painfully important. A robot that fails early, stops safely, and preserves recoverability may be safer than one that succeeds after a near-miss ballet nobody logged.

Finally, deployment assurance must become a living argument. The paper points toward documentation, incident learning, and robot-specific safety records: what hazards were tested, which near misses occurred, which interventions worked, which contact limits were respected, and which assumptions were invalidated after deployment.

The practical implication is blunt. If a company cannot reconstruct how a robot handled risk during a task, it does not have a safety case. It has a highlight reel.

Conclusion: success is an outcome; safety is a trace

The paper’s main contribution is not a new robot or a new benchmark score. It is a disciplined vocabulary for saying what kind of safety evidence exists, where it enters the system, and where it stops being valid.

That vocabulary matters because long-horizon manipulation is full of seductive false comfort. A robot may plan coherently, act fluently, recover impressively, and still be unsafe in the procedural details that a final success metric ignores. Conversely, a robot may fail a task while demonstrating exactly the kind of safe hesitation, handoff, or recovery behavior that deployment teams should want.

The business lesson is therefore not “wait until robotics safety is solved.” That would be convenient, and also a fine way to wait forever. The lesson is to stop treating safety as a single property attached to a model or robot. Treat it as a layered evidence record.

Ask where the safety mechanism acts. Ask which hazard it covers. Ask whether the evidence is formal, statistical, or empirical. Ask what changes invalidate the claim. Ask whether the rollout trace can distinguish safe success from unsafe success.

The robot finishing the task is good news. It is not the audit.

Cognaptus: Automate the Present, Incubate the Future.

Dabin Kim, Daemin Park, Sangyub Lee, Jinsik Kim, Yeongtak Oh, Jongho Shin, and Sungroh Yoon, “Safe Embodied AI for Long-horizon Tasks: A Cross-layer Analysis of Robotic Manipulation,” arXiv:2606.05660, 2026. ↩︎

Long-horizon manipulation is where small mistakes learn patience#

The first category: planning-time safety catches bad tasks before they become bad motion#

The second category: policy-time safety shapes what the robot is likely to propose next#

The third category: execution-time safety is where assumptions meet furniture#

The fourth category: evaluation must remember the route, not just admire the destination#

A useful business translation: safety claims need labels#

The paper’s strongest warning is about evidence transfer#

What the paper directly shows, what Cognaptus infers, and what remains uncertain#

The deployment lesson: build the safety record before the robot needs an excuse#

Conclusion: success is an outcome; safety is a trace#