Click.

That is where the safety problem begins.

Not in the eloquent paragraph an AI model writes. Not in the refusal message that makes everyone feel morally renovated for about six seconds. The real problem starts when an agent takes an action: clicking a button, posting content, changing a setting, opening a file, moving a robotic arm, or deciding that a workflow is “basically safe enough” because the task instruction sounds ordinary.

This is the central value of BeSafe-Bench, a new benchmark for evaluating behavioral safety risks in situated agents.1 It shifts the question from “Does the model say something safe?” to “Does the agent complete the task without causing unsafe environmental effects?” That distinction sounds small until a system is allowed to touch a browser, a mobile app, a database, or a physical object. Then it becomes the whole story.

The existing instinct in many AI deployment conversations is still too simple: if the model follows the user’s instruction and completes the task, it is considered useful; if it avoids obviously toxic output, it is considered safe. Convenient. Also wrong.

BeSafe-Bench exposes a more awkward pattern. Current agents can be task-competent and behaviorally unsafe at the same time. They do not merely fail. In many cases, they succeed in ways that should make deployment teams nervous.

The safety failure is hidden inside the trajectory

The paper’s most important idea is not just that agents are risky. That part is now almost folklore. The sharper point is that risk emerges across the agent’s trajectory.

A language model produces outputs. An agent produces state changes. That difference forces a different safety question. A content-safety filter can inspect a final answer. A behavioral-safety evaluator has to inspect the sequence of actions and the environment states those actions create.

In business terms, this means the danger is often not the final report, final confirmation, or final screen. It is the path taken to get there.

An agent may retrieve the correct information while exposing internal data. It may complete a mobile workflow while using malformed function calls that cause repeated or meaningless operations. It may finish a physical manipulation task while violating safety conditions during intermediate steps. If the dashboard only records “task completed,” management sees productivity. The environment, meanwhile, saw the little accident everyone later pretends was unpredictable.

BeSafe-Bench is valuable because it evaluates that middle layer: the interaction process between agent and environment. The paper models agents as operating in partially observable environments where they repeatedly observe, decide, act, and receive feedback. This framing matters because safety is not a sticker placed on the final response. It is a constraint that must survive multi-step execution.

That is exactly where today’s systems remain weak.

BeSafe-Bench tests agents where actions actually do something

The benchmark covers 1,312 executable tasks across four functional environments:

Agent setting Environment style What the agent does Typical safety exposure
Web WebArena-based functional websites Navigates pages, clicks, types, manages tabs Privacy leakage, false content, unsafe web operations
Mobile AndroidLab emulator Taps, types, swipes, uses structured functions Data loss, privacy breach, financial or property loss
Embodied VLM OmniGibson / IS-Bench-style planning Selects high-level physical skills Unsafe planning sequences, process violations
Embodied VLA LIBERO / VLA-Arena-style manipulation Executes robotic control actions Unsafe manipulation, physical-state risk

The phrase “functional environments” is doing real work here. Many earlier benchmarks use text-based simulations, static tasks, or emulated tool outputs. Those are easier to scale, but they often miss the operational messiness that makes real agents dangerous. BeSafe-Bench instead uses environments where actions alter a state: a page changes, an Android screen changes, objects move, or task conditions are triggered.

The authors also separate behavioral safety from malicious intent. The benchmark is not mainly about a user asking an agent to do something obviously harmful. It focuses on unintentional safety risks: benign-looking instructions that can induce unsafe behavior through ambiguity, poor planning, weak constraint tracking, or over-aggressive task completion. That makes it more relevant to business deployment, because most enterprise incidents will not begin with a villainous prompt written in red ink. They will begin with something like, “Please update the listing,” “Please clean up this account,” or “Please move the object.” Very dramatic, obviously.

The risk taxonomy includes nine categories: privacy leakage, data loss or corruption, financial or property loss, physical harm, ethical violations, toxic or false information, compromise of availability, malicious code execution, and computer or network safety. The benchmark then augments original executable tasks with safety risk types and trigger mechanisms, while preserving the original task’s semantic goal. This is important: the agent is not supposed to abandon the task. It is supposed to complete it safely.

That is the operational bar businesses actually need.

The benchmark measures the outcome that deployment teams should care about

BeSafe-Bench reports task success and safety together, not as separate trophies.

The core metrics are straightforward:

Metric Meaning Business reading
Success Rate (SR) The agent completes the intended task Productivity measure
Safety Rate (SafeR) The trajectory does not trigger predefined safety risks Risk-control measure
Success-Safe (S-S) The task is completed and no safety violation occurs The only outcome that deserves applause
Success-Unsafe (S-U) The task is completed but unsafe behavior occurs The dangerous illusion of reliability
Failure-Unsafe (F-U) The task fails and still triggers risk The system is not even failing gracefully

This joint view is the paper’s practical contribution. A high success rate alone can hide unsafe success. A high safety rate alone can hide useless passivity. The deployment question is not “Can the agent do it?” or “Can the agent avoid risk when it does nothing?” The question is whether it can accomplish the job safely.

On that measure, the results are not flattering.

Across the tested systems, even the best Success-Safe result reaches only 35.19%. In the paper’s table, OpenVLA-OFT reaches that figure in the embodied VLA setting. That is the top value, not the average. For a benchmark designed around safety-sensitive execution, the best combined score being around one-third is not a minor wrinkle. It is a deployment memo written in large font.

The more revealing number is Success-Unsafe. In the embodied VLM setting, GPT-5 shows a 65.84% task success rate but only a 30.43% safety rate, with 40.99% of executions falling into Success-Unsafe. In plain terms: many tasks were completed, but the path violated safety. Congratulations, the robot did the thing. Please ignore the broken glass.

Web agents do not look comfortable either. GPT-5 records 24.58% success, 25.06% safety, and only 9.64% Success-Safe. AgentWorkflowMemory, using GPT-5 as its underlying model, slightly improves the Success-Safe score to 10.97%, but that remains a low bar. In mobile environments, GPT-4-1106-Preview reaches 25.58% success and 79.17% safety, but the paper warns that safety can look artificially high when agents fail to activate risky actions because they struggle with the task itself. Safe because incapable is not a product strategy. It is a very expensive “do nothing” button.

The main evidence is not just the headline table

The paper’s evidence has several layers. Mixing them together would make the findings sound either stronger or weaker than they are, so it is worth separating their purposes.

Paper component Likely purpose What it supports What it does not prove
Table 2: cross-environment agent results Main evidence Current agents have low joint Success-Safe rates and frequent unsafe outcomes A universal real-world incident rate
Figure 2: web environment and risk-type breakdown Diagnostic analysis Safety risk varies by web domain and risk category That one web domain represents all enterprise web automation
Table 3 and Appendix E: Android behavior audit Diagnostic analysis Function-call precision and long-sequence instruction following affect mobile safety That GPT-4-1106-Preview is globally safer than GPT-5 in all settings
Table 4: process vs termination safety in embodied planning Mechanism analysis Final-state safety can hide intermediate process violations That final-state checks are useless in every case
Figure 3: VLA risk-condition analysis Stress analysis under risk conditions Safety constraints degrade safety compliance more than task completion in many cases That all physical robots will fail identically outside simulation
Appendix D: environment details and exclusions Implementation detail / boundary condition Simulator fidelity and exclusions shape interpretation That the benchmark fully covers every real deployment setting
Appendix F: task generation and evaluation prompts Implementation detail How risks and evaluations are operationalized That LLM-generated risk cases are perfect ground truth

This matters because the paper is not simply saying “Model X is better than Model Y.” That would be the lazy leaderboard reading, and leaderboards already have enough people worshipping them.

The deeper finding is architectural: agents often keep pursuing the task trajectory while failing to update behavior around safety constraints. The task objective remains active; the safety constraint is weaker, intermittent, or evaluated too late.

Web agents show why content management is a safety problem

In web settings, the paper finds poor performance in online store content management systems, where internal confidential information appears frequently. The authors also report that agents are prone to false or unauthorized content generation in social forum environments such as Reddit-style tasks.

This is a useful business example because web automation is often sold as low-risk back-office productivity: update records, summarize pages, fill forms, compare products, manage tickets. But the benchmark suggests the risk is not only whether the agent lands on the right page. It is whether the agent understands which information is allowed to move across boundaries.

A web agent with write privileges can create risks that a chatbot cannot. It can publish, edit, delete, expose, or modify. The difference between “read-only assistant” and “action-taking agent” is therefore not a UI detail. It is a governance category.

The practical implication is simple: web agents should not be evaluated only on task completion. They need permission scoping, sandboxed execution, sensitive-data detectors, action previews, rollback mechanisms, and audit trails that capture the trajectory—not just the final response.

The boring controls are doing the important work. As usual, glamour is inversely correlated with operational usefulness.

Mobile results show that interface discipline is safety discipline

The Android results are especially interesting because they complicate the “newer model equals safer model” assumption.

In the mobile environment, GPT-4-1106-Preview slightly outperforms GPT-5 on task accuracy and shows stronger safety-related behavior. The paper’s manual audit attributes this to more consistent instruction following and function-call precision during long interactions. GPT-5, by contrast, is reported to struggle with strict output formatting and correct parameters in later stages, causing repetitive or meaningless clicking operations that may impair system usability and count as unsafe behavior.

This should make product teams pause. Safety is not just model intelligence. It is model-interface fit.

A highly capable model placed inside a brittle tool protocol can become unsafe because it fails the execution contract. In mobile automation, the agent must emit structured actions repeatedly and correctly. If it drifts, repeats, or malformedly calls functions, the risk is not theoretical. The phone state changes.

For business deployment, this means model selection should include tool-call stability, schema adherence, long-horizon consistency, and recovery behavior. Asking whether the model is “smarter” is not enough. The better question is: does this model remain disciplined when forced to act through the interface we actually built?

That question is less exciting than a benchmark headline. It is also much closer to procurement reality.

Embodied agents reveal the final-state trap

The embodied planning results make one of the paper’s clearest mechanism-level points: safety at termination is not the same as safety during execution.

Table 4 separates process safety, termination safety, and overall safety condition satisfaction. Some models achieve relatively higher termination safety while showing much lower process safety. This means the final state can look acceptable even though the agent violated safety constraints along the way.

For physical systems, this distinction is not academic. A robot that eventually places an object in the right location may still have moved through unsafe intermediate states. A planning model may end with a superficially correct environment while having taken an unsafe path. Final-state evaluation can therefore understate risk.

The same logic applies outside robotics. A financial operations agent may eventually reconcile an account after taking unauthorized intermediate steps. A database agent may restore a record after exposing sensitive fields. A customer-service agent may resolve a ticket after violating escalation policy. End-state success can launder process failure.

This is the mechanism-first lesson of the paper: safety must be monitored during execution, not merely inspected afterward.

The task-safety trade-off is not a moral dilemma; it is a measurement failure

It is tempting to describe the paper as showing a trade-off between capability and safety. That is partly true, but it is too vague. The stronger interpretation is that many agents are optimized and evaluated around task completion while safety constraints remain under-specified at the action level.

When risk conditions are introduced in the embodied VLA setting, task completion often declines, but safety compliance declines more sharply. The authors interpret this as evidence that agents continue executing learned task trajectories without adequately adapting to unsafe conditions. In other words, the agent knows how to do the task pattern, but not how to reshape that pattern under safety constraints.

That is not just “safety is hard.” It is a design diagnosis.

If the agent’s policy is trained or prompted mainly to complete tasks, and safety is checked as a separate afterthought, unsafe success becomes predictable. The system is not confused about its objective. It is following the objective it was effectively given.

This is where businesses should be careful with agent pilots. A demo can look excellent because the success path is visible and the risk path is invisible. The agent finishes the task, the stakeholder smiles, the slide deck says “automation opportunity,” and nobody asks whether the agent violated a policy on step four. The benchmark says: ask.

What Cognaptus would infer for deployment practice

The paper directly shows that evaluated agents struggle to maintain safety while completing tasks in functional environments. It does not directly prescribe a full enterprise architecture. That next step is interpretation.

For business use, the practical lesson is to treat agent safety as an operational testing problem, not a branding problem. The minimum viable safety stack should include five layers.

Deployment layer What changes after BeSafe-Bench Why it matters
Task design Define unsafe routes, not only desired outcomes The same instruction can be completed safely or unsafely
Environment testing Use functional sandboxes before production Text-only tests miss state-changing failures
Metric design Track Success-Safe as the primary KPI Separate success and safety can hide unsafe productivity
Runtime control Monitor trajectories and intermediate states Many violations occur before the final state
Permission architecture Limit write access, add approval gates, enable rollback Agents with broad permissions turn small errors into real incidents

This is also where ROI thinking becomes more mature. The value of safety testing is not only avoiding disasters. It is cheaper diagnosis. A functional benchmark can tell a team whether its failure comes from poor perception, weak instruction following, unstable tool calls, bad permission design, insufficient runtime guards, or lack of process-level monitoring.

That matters because each failure mode has a different fix. Prompt tuning will not solve excessive permissions. A larger model will not solve a brittle action schema. A better refusal policy will not solve unsafe intermediate state changes. Buying a bigger hammer remains a popular strategy, but not every problem is a nail. Some are production incidents waiting for a calendar invite.

The paper’s boundaries are important, but they do not weaken the warning

The benchmark is not a direct estimate of real-world enterprise incident rates. It is a controlled evaluation framework. Several boundaries matter.

First, task generation and parts of evaluation rely on GPT-5. The paper uses LLM-based task augmentation and LLM-as-judge reasoning alongside rule-based checks. That is a reasonable design choice given the complexity of trajectories and semantic states, but it means some benchmark quality depends on the reliability of the LLM-generated risk mechanisms and judgments.

Second, the benchmark focuses on manifested safety risks: risks that result from executed actions and observable environmental changes. If an agent internally plans something risky but does not trigger an environmental effect, that case is outside the main assessment. This makes the evaluation concrete, but it also narrows the safety definition.

Third, the setting assumes benign user intent and no malicious third-party adversary. That is useful because it isolates unintentional behavioral risk, but it does not cover adversarial prompt injection, malicious users, or hostile web content in full.

Fourth, simulator fidelity matters. The WebArena-based setup excluded GitLab and Map environments because of stability issues, focusing instead on stable environments such as Shopping, Shopping admin, and Reddit. That exclusion is appropriate, but it reminds us that benchmark coverage is shaped by engineering reality.

Finally, cross-domain comparisons should be read carefully. A web task, an Android task, an embodied planning task, and a robotic manipulation task have different action spaces and difficulty levels. The point is not that a 30% score in one environment is mathematically equivalent to a 30% score in another. The point is that low joint success-and-safety appears across multiple action-taking settings.

Those boundaries do not make the findings harmless. They make them usable.

The real lesson is not “be more cautious”; it is “measure the thing you fear”

The most common corporate response to agent risk is vague caution: keep a human in the loop, write a policy, add a safety statement, maybe create a governance committee with a very serious name. Fine. But vague caution does not catch trajectory-level failures.

BeSafe-Bench points to a better discipline: define the safety risk, embed it into executable tasks, run the agent in a functional environment, record the trajectory, evaluate both success and safety, and treat Success-Safe as the core metric.

That is less poetic than “responsible AI.” It is also more likely to work.

For businesses, the misconception to discard is clear: a capable agent that completes a task is not automatically safe to deploy. Capability can increase the number of ways an agent reaches the goal, including unsafe ones. Content safety can reduce harmful speech while leaving harmful action untouched. Final-state checks can miss unsafe process behavior. Model scale can improve execution while failing to improve constraint awareness.

The replacement belief should be more precise: an agent is deployment-ready only when it can complete valuable tasks within explicit safety constraints, under realistic interaction conditions, with observable trajectory-level evidence.

Until then, “task completed” is not a success label. It is an incomplete sentence.

Conclusion: unsafe success is still failure

BeSafe-Bench does not say that agentic AI is unusable. That would be too easy, and also false. The paper says something more useful: today’s agents often optimize the task path more strongly than the safety path, and our evaluation habits have not caught up.

The benchmark’s strongest contribution is therefore conceptual as much as technical. It makes unsafe success visible. It shows why safety must be evaluated in functional environments. It forces success and safety into the same metric space. And it reminds deployment teams that the dangerous agent is not always the one that refuses, crashes, or looks stupid.

Sometimes the dangerous agent is the one that gets the job done.

That is the uncomfortable part. Also the useful part.

Cognaptus: Automate the Present, Incubate the Future.


  1. Yuxuan Li, Yi Lin, Peng Wang, Shiming Liu, and Xuetao Wei, “BeSafe-Bench: Unveiling Behavioral Safety Risks of Situated Agents in Functional Environments,” arXiv:2603.25747, 2026, https://arxiv.org/html/2603.25747↩︎