Robots are very good at looking confident.

Put a camera on a robot arm, train it with enough demonstrations, and it may glide toward a box, a switch, or a tool with the calm precision of something that understands the world. Then contact happens. The fingertip presses too hard. The switch has not actually toggled. The object slips, bends, jams, or quietly enters the expensive category known as “damaged inventory.”

That is the awkward part of manipulation: vision can tell a robot where the world is, but force tells it what the world is doing back.

The paper behind ImplicitRDP addresses exactly this problem.1 Its central claim is not simply that robots should use force sensors. That part is obvious, and obvious things are where many robotics demos go to die. The real problem is subtler: adding force readings to a visual imitation-learning policy does not automatically make the policy reactive. If the policy predicts a chunk of future actions and then executes that chunk mostly open-loop, high-frequency force feedback may arrive technically, numerically, and spiritually too late.

ImplicitRDP matters because it tries to remove that gap. It builds a single end-to-end visual-force diffusion policy that keeps slow visual planning and fast force reaction inside one causally structured model. No separate slow planner passing a compressed latent instruction to a fast controller. No polite handover between modules where geometry quietly disappears. The robot sees, feels, and updates its action sequence inside the same policy.

That sounds like architecture plumbing. It is actually the story.

The old split: vision plans, force cleans up the mess

Modern imitation-learning systems often use action chunking. Instead of predicting one tiny motor command at a time, the policy predicts a short sequence of actions. This gives smoother behavior and lets the model capture longer motion patterns. In diffusion policies, the model learns to generate those action sequences by denoising candidate trajectories conditioned on observations.

For vision-based manipulation, this is powerful. A wrist camera or external camera gives spatial context. The model learns that the gripper should move here, rotate there, and follow a trajectory that looks like successful demonstrations.

But contact-rich manipulation is not just geometry. It is negotiation.

A box may require gentle sustained pressure. A switch may require a short force burst at the right location. A wiping motion may need continuous adjustment as resistance changes. In these cases, the most important signal may not be a new image frame; it may be a force spike or missing force response that appears between action decisions.

Classic chunked policies have a problem here. Within a chunk, the policy is often effectively open-loop. Force can be included in the input at the beginning of the prediction window, but the policy does not necessarily revise each action as force changes. So the model has a sensor channel called “force,” while its behavior still says, “Thank you, I will ignore that until the next planning cycle.”

A previous answer to this problem is hierarchical slow-fast control. Reactive Diffusion Policy, the baseline discussed in the paper, separates the system into a slow visual policy and a fast force-reactive policy. The slow part handles visual planning. The fast part reacts to force.

That division is intuitive. It is also where the paper finds trouble.

The fast controller may not have enough spatial context. The slow planner compresses visual and action information into a latent representation, and the fast policy must decode that into real actions using current force. If the compressed plan loses geometry, the reactive layer can press with enthusiasm in the wrong place. Robotics has many ways to fail; “confidently reactive but spatially confused” is one of the less glamorous ones.

ImplicitRDP’s design starts from a different assumption: slow and fast signals should not be assigned to separate modules too early. They should be handled as different-frequency tokens inside one causal sequence model.

Structural slow-fast learning makes reaction possible inside an action chunk

The first contribution is Structural Slow-Fast Learning, or SSL. The name is not poetic, but the mechanism is useful.

ImplicitRDP separates observations into two groups. Vision and proprioception are treated as slow observations: rich in spatial context, but relatively low-frequency. Force signals are treated as fast observations: narrow, local, noisy, and rapidly changing. Instead of sending them through separate policies, the model places them into a unified Transformer-based diffusion policy.

The critical part is temporal causality.

Force tokens are encoded as a sequence aligned with the action chunk. A GRU preserves causal force encoding, and a causal attention mask prevents an action prediction from attending to future force readings. In plain terms, when the model predicts what to do now, it can use force feedback from the past and present, not magical information from the future. This matters because training can otherwise cheat. A model trained with future force leakage may look excellent offline and then become impressively useless on the robot.

The inference mechanism is equally important. Diffusion models are stochastic by default. Stochasticity is charming in image generation. It is less charming when a robot fingertip is pressed against a fragile object.

ImplicitRDP uses a deterministic DDIM sampler and caches the slow visual context plus the initial noise at the beginning of an action chunk. Then, at each fast control step, it updates the force context, runs the denoising process consistently, and executes the latest action. The action sequence remains coherent, but force feedback can still influence behavior within the chunk.

So the model keeps two benefits that are usually in tension:

Requirement Why it matters in contact-rich manipulation How ImplicitRDP handles it
Smooth chunked action Avoids jitter and preserves non-Markovian motion patterns Uses diffusion-style action chunk modeling
High-frequency reaction Adjusts pressure, contact timing, and local behavior during execution Injects causal force tokens at fast control steps
Spatial awareness during contact Prevents force reaction from losing geometry Lets action tokens attend to both visual and force tokens
Consistent inference Avoids stochastic re-planning chaos inside one chunk Reuses cached slow context and deterministic DDIM noise

This is the real mechanism. ImplicitRDP is not “vision plus force.” It is “vision plus force with causal permission to matter.”

The virtual target turns force into an action-language signal

The second contribution addresses a problem common in multimodal learning: modality collapse.

End-to-end models are practical creatures. If one input channel gives an easier path to minimizing loss, the model may overuse it and quietly underuse the others. In contact-rich manipulation, force can be especially easy to mishandle. It is noisy in free space, task-critical during contact, and often measured in a different coordinate frame from the action trajectory.

The paper’s solution is Virtual-target-based Representation Regularization, or VRR.

Instead of asking the network to predict raw future force, ImplicitRDP asks it to predict a “virtual target” derived from compliance control. A compliance system can be thought of as relating external force, current pose, and a desired virtual pose through stiffness. Under the quasi-static simplification used in the paper, force feedback can be mapped into a virtual target in Cartesian action space.

That mapping matters for two reasons.

First, the virtual target lives in the same space as the robot’s action. Raw force may be measured in the tool-center-point frame, while actions are trajectories in the robot base or world frame. Predicting raw force asks the model to learn a physically meaningful but representation-mismatched auxiliary task. Predicting a virtual target makes the auxiliary task closer to the thing the policy actually controls.

Second, the adaptive stiffness mechanism changes how much the auxiliary signal matters across phases. In free motion, measured force is often small and may mostly reflect sensor noise. The method assigns high stiffness, so the virtual target remains close to the actual pose and the auxiliary signal stays quiet. During contact, force becomes large, stiffness in the force direction becomes lower, and the virtual target deviates more from the actual pose. That makes contact events more visible to the loss.

A simple way to read this is: the model is not rewarded equally for caring about force everywhere. It is pushed to care about force when force means contact, not when force means measurement noise having a little performance art moment.

Auxiliary objective What it teaches the model Main weakness Paper’s interpretation
No auxiliary task Learn action from available observations Force may be ignored Modality collapse appears more likely
Raw force prediction Learn physical feedback directly Frame mismatch and noisy free-space signal Helps somewhat, but not enough
Virtual target prediction Translate force into action-space correction Requires compliance-inspired design choices Best-performing auxiliary task in the experiments

This is where the paper becomes more interesting for business readers. The point is not that auxiliary losses are good. That sentence has been true for long enough to need retirement. The point is that the auxiliary target must be operationally aligned with the decision the system needs to make.

For robot manipulation, “what force will I feel?” is less directly useful than “how should my target motion shift because of this force?”

The main evidence: two contact tasks, two different force regimes

The experiments use two real-world contact-rich manipulation tasks: box flipping and switch toggling. Each task uses 40 demonstrations. The setup includes a Flexiv Rizon 4s robot arm, a 6-axis force/torque sensor, joint torque sensors for kinesthetic teaching, a wrist camera, and custom compliant fingertips. Data is recorded at 10 Hz.

The two tasks are deliberately different.

Box flipping requires sustained force. The robot must push a thin phone box against a fixture to flip it upright, while avoiding excessive force. This resembles tasks where pressure must be maintained without brute force: peeling, wiping, or handling delicate items.

Switch toggling requires a short-duration force burst. The robot must locate and actuate a circuit breaker switch. The important state is not visually obvious: a vision-only model may not know whether the triggering force has been reached.

The headline results are strong:

Method Box flipping Switch toggling
Vision-only Diffusion Policy 0 / 20 8 / 20
Hierarchical Reactive Diffusion Policy 16 / 20 10 / 20
ImplicitRDP 18 / 20 18 / 20

The interpretation should be task-specific.

On box flipping, the vision-only diffusion policy fails completely in the reported trials. The paper’s failure analysis says it tends to apply excessive force. That is exactly the failure one would expect when vision is asked to infer a hidden contact condition. The model sees the box. It does not reliably feel whether the pressure is already enough.

The hierarchical RDP baseline does much better on box flipping: 16 out of 20. This matters. It means the older slow-fast idea is not silly. Giving force a reactive role helps. But ImplicitRDP still improves the result to 18 out of 20 while using a unified end-to-end policy.

Switch toggling is where the distinction becomes sharper. Vision-only DP reaches 8 out of 20. Hierarchical RDP reaches 10 out of 20. ImplicitRDP reaches 18 out of 20. The paper attributes RDP’s difficulty partly to contact-location errors during approach, likely caused by precision loss from compressing actions into a latent space before the fast policy decodes them.

That is the key point: reaction without geometry is still brittle. Force does not merely need to be fast. It needs to remain connected to spatial context.

The ablations separate the mechanism from the decoration

The paper includes several tests beyond the headline baseline comparison. These are easy to summarize badly, so they deserve a more disciplined reading.

Test Likely purpose What it supports What it does not prove
DP vs RDP vs ImplicitRDP Main evidence against prior baselines Unified visual-force causal modeling performs better on the two tested tasks General dominance across all manipulation tasks
Removing SSL and VRR Ablation of the core mechanism Open-loop force input is not enough That SSL alone explains every gain
Removing SSL while keeping VRR Ablation of closed-loop force control VRR cannot fully compensate for lack of fast closed-loop reaction That VRR is unimportant
No auxiliary vs force prediction vs virtual target Auxiliary-task comparison Virtual target is more useful than raw force prediction in these tasks That virtual targets are universally optimal
Prediction and rotation choices Implementation stability ablation Velocity prediction and Euler-angle representation improved stability here That these are always best choices in all robot settings

The SSL ablation is especially informative.

When both SSL and VRR are removed, the model reaches 6/20 on box flipping and 5/20 on switch toggling. That version still has force input, but not the closed-loop causal mechanism or virtual-target regularization. In other words, force as a feature does not magically become force as control.

When VRR is added but SSL is still removed, box flipping falls to 4/20 while switch toggling rises to 15/20. That split is revealing. For switch toggling, a short force event may benefit from better representation even without full closed-loop force maintenance. For box flipping, sustained pressure is less forgiving. If the robot cannot adjust force continuously inside the chunk, it drifts into failure.

The auxiliary-task comparison is also clean:

Auxiliary task Box flipping Switch toggling
None 6 / 20 6 / 20
Force prediction 8 / 20 10 / 20
Virtual target prediction 18 / 20 18 / 20

This table supports the paper’s claim that the virtual target is not just a prettier label for force prediction. Raw force prediction helps, but the virtual target helps much more in the tested setting. The likely reason is not mystical physics branding. It is representation alignment: the auxiliary target is closer to the action correction the policy must learn.

The learning-stability ablation belongs in a different category. The paper reports that velocity-prediction outperforms noise-prediction and sample-prediction, and that Euler angles outperform a 6D rotation representation in the tested conditions. The full model reaches 18/20 on both tasks; the 6D rotation variant, for example, reaches 16/20 on box flipping but 12/20 on switch toggling.

This is not a second thesis about rotation representations. It is an implementation warning. Contact-rich policies can become unstable when high-frequency force noise meets a powerful Transformer diffusion model. The model architecture may be elegant, but if the action representation jitters, the robot will still behave like it drank three espressos and found a circuit breaker.

The hardware and controller details are not footnotes

One of the paper’s most practically important sections is easy to overlook: implementation details.

The authors do not just train a model and declare victory. They also design custom compliant fingertips. The reason is simple. If both the fingertip and object are rigid, force-induced action differences can be subtle and easily buried in sensor noise. Compliance makes contact produce clearer reactivity signals, giving the network better pairs of force feedback and action adjustment to learn from.

They also tune the low-level controller. Because ImplicitRDP is supposed to learn reactive behavior itself, the low-level controller must track commanded positions precisely rather than adding its own compliance. The paper modifies the robot’s default impedance controller by tuning integral gains in the Cartesian PI controller.

This is a boundary condition with business importance.

A manager reading the success-rate table might conclude that ImplicitRDP is a model upgrade. It is, but not only that. The demonstrated system includes model structure, auxiliary learning objective, compliant hardware, force sensing, and controller tuning. Remove those supporting pieces and the performance claim does not travel intact.

That does not weaken the paper. It makes it more useful. Real automation gains usually come from the interface between model, sensor, end-effector, and controller. Anyone selling “just add AI” to contact-rich robotics should be charged a small tax for every object their robot crushes.

What the paper directly shows

The direct evidence is narrow but meaningful.

The paper shows that, on two real-world contact-rich tasks, ImplicitRDP outperforms a vision-only diffusion policy and a hierarchical visual-force baseline. It also shows that open-loop force input is not enough, that virtual-target prediction outperforms raw force prediction as an auxiliary task in the tested setting, and that certain implementation choices materially affect stability.

The results are strongest where the task requires force to change behavior at the moment of contact. Box flipping tests sustained force regulation. Switch toggling tests whether the robot can locate contact and apply a short force burst without relying on visual state changes that may be too small to see.

The evidence is not a claim that ImplicitRDP solves general robotic manipulation. It is a claim that a unified causal visual-force diffusion policy can outperform both vision-only and explicit hierarchical slow-fast designs in representative contact-rich tasks.

That is already enough.

What Cognaptus infers for business use

The business relevance is not “robots can now feel.” That is cute, and also too broad.

The more grounded inference is that contact-rich automation may benefit from policies that keep high-frequency feedback inside the learned action-generation loop, rather than treating sensor readings as passive context or routing them through a separate reactive layer.

This matters for several classes of work:

Work category Why force matters What ImplicitRDP suggests
Delicate packaging Objects deform, slide, or crush before vision shows failure Force-aware closed-loop chunks may reduce damage
Switches, buttons, and latches Success depends on hidden actuation thresholds Force can confirm whether contact has crossed the needed threshold
Wiping, polishing, peeling Sustained pressure must stay within a useful range Causal force feedback can adjust during execution, not after
Small-part assembly Geometry and contact must remain coupled Unified visual-force modeling may avoid latent handoff errors
Food and soft goods handling Visual state is ambiguous and material response varies Compliant sensing plus learned reaction may reduce manual tuning

The ROI pathway is therefore not just higher task success. It may include fewer damaged items, smaller safety margins, less manual controller tuning across variants, and better generalization across objects whose contact behavior differs despite similar appearance.

But those are inferences, not direct measurements from the paper. The paper does not measure factory throughput, maintenance cost, deployment time, or labor substitution. It measures task success in a lab setup. A serious business reading should preserve that distinction.

Where the boundary really is

The limitations are not generic. They are concrete.

First, the experimental scope is small: two tasks, 40 demonstrations per task, and 20 evaluation trials per method-task pair. The tasks are well chosen for contact-rich behavior, but they do not cover the range of industrial manipulation.

Second, the setup is specialized. The system uses a specific robot arm, wrist camera, 6-axis force/torque sensing, custom compliant fingertips, and controller tuning. A company cannot assume the method will transfer unchanged to another robot, gripper, sensor layout, or compliance profile.

Third, the force data frequency is 10 Hz in the reported setup. That is sufficient for these tasks, but other contact-rich tasks may require faster tactile or force response. The paper itself points toward future integration with high-frequency tactile sensing.

Fourth, the virtual-target regularizer depends on compliance-inspired design choices such as stiffness assignment. The method is more physically aligned than raw force prediction, but it is not free of heuristics.

Finally, the method’s strongest claim is about contact-rich manipulation where force changes the correct action. For tasks dominated by long-horizon semantic planning, tool selection, or language-conditioned instruction following, ImplicitRDP is not the whole story. The authors note future work on extending the framework to Vision-Language-Action models. That extension is plausible, but it is not what this paper proves.

The useful lesson is structure, not sensor worship

The misconception to avoid is simple: force sensing is not the same as force control.

A robot can record force and still behave open-loop. A model can receive force tokens and still ignore them. A hierarchy can react quickly and still lose geometry. The paper’s contribution is to make force feedback causally available inside the action-generation process while keeping visual context attached.

That is the broader design lesson for applied AI systems: multimodal input is cheap; useful multimodal structure is not.

ImplicitRDP’s mechanism-first value is that it shows where the structure must sit. Not after the policy has already produced a chunk. Not in a separate fast module starved of geometry. Not as a raw auxiliary prediction target that lives in the wrong representation space. The structure must shape how observations, actions, and feedback interact during inference.

For robotics, that means a policy that can revise contact behavior while preserving coherent motion. For businesses, it points toward a more realistic path for automation in messy physical tasks: not humanoid theatre, not general-purpose magic, but better handling of the moment when the robot touches the world and the world refuses to be a JPEG.

Robots that only see can look competent. Robots that feel, interpret, and adjust may actually become useful.

There is a difference. The box, eventually, will notice.

Cognaptus: Automate the Present, Incubate the Future.


  1. Wendi Chen, Han Xue, Yi Wang, Fangyuan Zhou, Jun Lv, Yang Jin, Shirun Tang, Chuan Wen, and Cewu Lu, “ImplicitRDP: An End-to-End Visual-Force Diffusion Policy with Structural Slow-Fast Learning,” arXiv:2512.10946, 2025. https://arxiv.org/abs/2512.10946 ↩︎