Mind the Gap: When Robots Learn Social Norms the Human Way

A hotel robot does not need to understand the human soul. It does, however, need to stop cutting between two guests mid-conversation like an intern late for coffee.

That distinction matters. Most enterprise conversations about autonomous agents still treat navigation as a logistics problem: reach the destination, avoid collision, minimise delay. Very tidy. Very spreadsheet. Also incomplete. In public-facing environments, a robot can be technically safe and still socially unpleasant. It can avoid hitting people while still making them step back, tense up, or wonder why the expensive machine has the spatial awareness of a supermarket trolley.

The arXiv paper “RLSLM: A Hybrid Reinforcement Learning Framework Aligning Rule-Based Social Locomotion Model with Human Social Norms” tackles exactly this awkward middle zone: not collision avoidance, but social acceptability.¹ Its proposed system, RLSLM, combines reinforcement learning with a rule-based Social Locomotion Model derived from psychological research. The important move is not that the robot “learns manners” in some charmingly anthropomorphic sense. The move is more useful: discomfort becomes a computable field; that field becomes part of the reward function; the learned policy changes its path accordingly.

That is the paper’s real business relevance. It is not a finished deployment package for hospital corridors, airports, hotels, malls, or eldercare facilities. It is a serious example of something enterprises will increasingly need: human-science priors embedded into machine-learning systems so behaviour can be tuned, audited, and tested before being inflicted on customers.

RLSLM begins from a simple observation: humans do not experience nearby movement as a neutral geometry problem.

A person standing in front of you is not socially equivalent to a person standing behind you. Walking between two people who are facing each other is not the same as walking through an empty gap of the same width. Passing close to someone’s body is not merely a matter of distance; orientation, facing direction, and implied interaction all matter. Humans know this without calculating it. Robots, bless their aluminium hearts, do not.

The paper imports a Social Locomotion Model into reinforcement learning. The model generates an orientation-sensitive, asymmetric “social influence” or discomfort field around people in a scene. Higher field values represent locations where a passing agent is expected to cause more discomfort. Instead of treating people as circles to avoid, the system treats surrounding space as socially graded.

That field has three components:

Component	What it captures	Why it matters operationally
Heading-relevant social component	How a person’s facing direction changes the socially appropriate path	Helps the agent avoid passing through someone’s front-facing personal zone
Heading-irrelevant social component	Baseline spatial discomfort around a person regardless of orientation	Prevents the agent from treating every non-collision path as acceptable
Collision avoidance component	Body-shape and physical proximity constraints	Keeps social navigation grounded in physical safety rather than polite theatre

The reinforcement-learning agent then optimises across three pressures: get to the goal, avoid wasting motion, and avoid high-discomfort social zones. This is a better framing than the usual “rules versus learning” debate. Rules provide the human prior. Learning handles the policy search. The reward function becomes the contract between them.

That contract is where the article should linger. In most business settings, a robot does not need an opaque end-to-end model that magically discovers human etiquette from vast trajectory logs. Nor does it need a brittle rulebook that says “stay one metre away” and calls that civilisation. RLSLM sits between those extremes: interpretable enough to inspect, flexible enough to adapt, and cheap enough in training terms to be interesting.

The policy learns paths, not just prohibitions

The paper uses an Advantage Actor-Critic framework. At each decision step, the agent observes its own position plus the relative positions and orientations of surrounding people. It chooses a movement direction, receives feedback, and updates its policy. The reward structure penalises unnecessary movement, rewards progress toward the destination, and penalises entry into socially uncomfortable regions.

The detail that deserves attention is not the acronym. Actor-critic reinforcement learning is a familiar tool. The interesting part is that the agent is not merely told “do not collide.” It is trained to trade off competing objectives.

A robot in a hotel lobby should not freeze whenever a guest is nearby. It should not make a theatrical five-metre detour around every person either. Politeness without efficiency is not service; it is congestion wearing a bow tie. RLSLM makes that trade-off explicit by adding a social behaviour weight that controls how strongly the agent responds to discomfort. In the paper’s sensitivity analysis, increasing that weight produces larger lateral deviations from the shortest path. With no social weighting, the agent follows the direct route. With too much social weighting, it becomes overly conservative.

That tunability is where business readers should pay attention. A warehouse robot, a hospital delivery robot, and a concierge robot do not need identical social distance settings. Even within one facility, behaviour may need to vary by zone: narrow service corridors, patient rooms, crowded lobbies, VIP areas, or event spaces. The paper does not solve that product-configuration problem, but it shows a plausible control surface for it.

The VR study is main evidence, not a deployment trial

The authors evaluate RLSLM using an immersive VR-based human-agent interaction setup. Participants experienced precomputed agent trajectories from a first-person perspective and rated the agent’s behaviour on a 1–5 scale. The study recruited 30 university students and staff, aged 18 to 29, with normal or corrected-to-normal vision. Scenarios included both single-human and multi-human layouts, including social formations such as people facing each other.

The comparison methods were RLSLM, COMPANION, and n-Body. The paper reports that RLSLM achieved a mean comfort rating of 4.21 out of 5 and significantly outperformed both baselines in single-human and multi-human scenarios under the reported statistical tests.

That is meaningful. It is also narrower than a product brochure would like.

The VR study tests perceived comfort for simulated trajectories in controlled scenes. The virtual humans are static. The agent’s paths are pre-generated. The resulting stepwise trajectories are smoothed with a Gaussian filter before presentation. Participants are mostly young adults from a university context. This is a strong evaluation for early-stage social acceptability, not evidence that a physical robot can handle a lunchtime hospital corridor full of wheelchairs, distracted visitors, reflective surfaces, and one person determined to walk backwards while filming a TikTok.

The correct interpretation is therefore:

Paper result	What it directly supports	What it does not prove
RLSLM receives higher comfort ratings than COMPANION and n-Body in VR	The learned policy better matches participant comfort judgments in controlled virtual scenes	Real-world deployment safety, robustness, or commercial ROI
Multi-human scenarios are included	The method can model simple social formations, such as groups facing each other	Dynamic crowd negotiation with moving humans
The social behaviour weight changes detour size	The system has an interpretable tuning lever	A universal setting across cultures, facilities, or robot embodiments
Ablations change path behaviour	HRSC, HISC, and CAC contribute to different social navigation effects	That these components are sufficient for all social norms

This is not a criticism. It is how early evidence should be read without spilling optimism all over the carpet.

The ablations explain why the model behaves differently

The ablation studies are not a second thesis. They are mechanism checks. Their purpose is to show that the social influence components are not decorative labels glued onto an RL policy after the fact.

The clearest result concerns the heading-relevant component. In 42 specially designed single-human scenarios, removing that component made the agent pass in front of the human in 23 cases. The full model did so in only 5 cases. That matters because “front” is socially loaded. Passing close behind someone and cutting across their front-facing space may have similar geometric distances, but not similar human meaning.

The heading-irrelevant component and collision avoidance component are tested through maximum lateral distance. Removing either leads to reduced lateral deviation, suggesting less stable or less compliant avoidance behaviour. In plainer terms: when pieces of the social field are disabled, the agent becomes less sensitive to how humans occupy space.

This is where the model earns some interpretability. A fully opaque learned navigator might also produce comfortable paths, but when it fails, the diagnosis becomes unpleasantly mystical. Here, the failure modes can be mapped to components: orientation sensitivity, general personal-space influence, or physical proximity. For an enterprise buyer, that difference matters. Systems that can be adjusted and audited are easier to govern than systems that merely shrug in tensor.

The dataset is a useful benchmarking asset

The paper also contributes a VR evaluation dataset and pipeline built in Unreal Engine 5.4. It includes varied human placements and orientations, with scenarios such as face-to-face blocking, group passage, and asymmetric crowd formations. The authors state that the code and UE project are open-sourced for reproducibility.

This is less glamorous than the model, but possibly more useful for the field.

Social navigation is difficult to benchmark because the target variable is not just path length, collision rate, or arrival time. It is perceived appropriateness. A robot can score well on geometry and poorly on being allowed back into the lobby. By collecting user annotations under immersive conditions, the paper pushes evaluation toward the actual adoption variable: how people experience the agent.

For companies, the template matters more than the exact dataset. Before deploying service robots, organisations should be testing not only whether the robot reaches the destination, but whether users find its route acceptable from their own viewpoint. Third-person videos are cheaper, but first-person VR gets closer to the embodied discomfort that social navigation creates. Annoyingly, humans insist on experiencing the world from inside their bodies.

The business value is configurable comfort, not robotic charm

For enterprises, RLSLM points to three practical pathways.

First, it suggests a way to reduce training burden. Instead of relying entirely on massive trajectory datasets, the system embeds psychological priors into the reward function. That can make learning more sample-efficient and more aligned with human intuitions. The paper reports stable learning within its fixed training budget, with multi-human scenarios converging more slowly than single-human ones. That pattern is unsurprising, but useful: complexity costs training effort, even when the reward design is good.

Second, it gives operators a more interpretable behavioural lever. The social behaviour weight changes how far the agent detours. In a commercial product, that could become a configuration layer: more conservative near patients, more efficient in back-of-house corridors, more spacious in luxury hospitality, more assertive in logistics zones. The paper does not validate those specific settings, but it makes the design space visible.

Third, it improves evaluation discipline. A company can run VR or simulation-based user studies before physical deployment. That does not replace field trials, but it can filter bad behaviours earlier. Discovering in simulation that a robot repeatedly cuts through conversational groups is cheaper than discovering it from customer complaints, incident reports, or a viral video with ominous music.

The broader lesson is not limited to robots. Many AI systems now operate in human workflows where success depends on respecting implicit norms: when to interrupt, when to escalate, when to ask, when to stay out of the way. RLSLM is a physical navigation paper, but its design logic generalises: encode human-relevant structure into the objective, then let learning optimise within that shaped space.

What the paper shows, what Cognaptus infers, and what remains open

The distinction matters enough to state directly.

Category	Content
What the paper shows	A hybrid RL system using a psychology-derived social locomotion model can generate trajectories that participants rate as more comfortable than two rule-based baselines in controlled VR scenarios.
What Cognaptus infers	Human-science reward design may be commercially valuable for service robots because it provides tunable, interpretable comfort behaviour without relying solely on large opaque datasets.
What remains uncertain	Performance in real physical environments, moving crowds, sensor noise, different cultures, older or more diverse users, multiple robot embodiments, and regulated safety contexts.

That uncertainty boundary is not small. The paper’s agent operates in simplified environments. It uses fixed step movements and does not consider the mechanical cost of turning. Humans in the evaluation scenarios are static. The participant group is limited. The baseline comparison is useful but not exhaustive. The study measures subjective comfort, not operational throughput, incident reduction, maintenance cost, insurance impact, or customer retention.

Those are not minor details if one is buying robots rather than writing conference papers.

A hospital deployment, for example, would need to know how the system handles people suddenly entering the path, patients with mobility aids, staff moving in groups, children behaving as children regrettably do, and corridors where the socially polite path is physically impossible. A mall deployment would need cultural calibration and crowd-density adaptation. A hotel deployment would need brand-specific behaviour: discreet, slow, and spacious in luxury contexts; quicker and more utilitarian in service corridors.

RLSLM gives a promising design principle. It does not remove the need for product engineering.

The strategic lesson: hybrid systems are governance-friendly

The most important implication of the paper is not that one model scored 4.21 out of 5 in a VR study. Useful number, yes. Not the whole story.

The strategic implication is that hybrid autonomy can be easier to govern than pure learned behaviour. Rule-based systems are interpretable but brittle. End-to-end learned systems are adaptable but often opaque. RLSLM shows a middle path: use cognitive science to define a structured discomfort landscape, then use reinforcement learning to find efficient behaviour inside it.

That architecture has governance advantages. Safety teams can ask what the reward function penalises. Product teams can tune social sensitivity. Researchers can ablate components and observe behavioural changes. Regulators and insurers can at least see the outline of the logic. Nobody has to pretend that a neural policy’s vibes are an audit trail.

For businesses, this is where “human-centred AI” becomes less like a keynote phrase and more like an implementation pattern. Human comfort is not left as a post-deployment survey question. It is brought into training as a measurable cost.

That is the right direction. The future of service autonomy will not be won by robots that merely move through space. It will be won by systems that understand that space is already socially occupied.

The robot does not need to be charming. It needs to stop being rude at scale.

Cognaptus: Automate the Present, Incubate the Future.

Yitian Kou, Yihe Gu, Chen Zhou, Dandan Zhu, and Shuguang Kuai, “RLSLM: A Hybrid Reinforcement Learning Framework Aligning Rule-Based Social Locomotion Model with Human Social Norms,” arXiv:2511.11323, 2025, https://arxiv.org/abs/2511.11323. ↩︎

The mechanism: turn social discomfort into a reward signal#

The policy learns paths, not just prohibitions#

The VR study is main evidence, not a deployment trial#

The ablations explain why the model behaves differently#

The dataset is a useful benchmarking asset#

The business value is configurable comfort, not robotic charm#

What the paper shows, what Cognaptus infers, and what remains open#

The strategic lesson: hybrid systems are governance-friendly#

The mechanism: turn social discomfort into a reward signal