
Kill Switch Ethics: What the PacifAIst Benchmark Really Measures
TL;DR PacifAIst stress‑tests a model’s behavioral alignment when its instrumental goals (self‑preservation, resources, or task completion) conflict with human safety. In 700 text scenarios across three sub‑domains (EP1 self‑preservation vs. human safety, EP2 resource conflict, EP3 goal preservation vs. evasion), leading LLMs show meaningful spread in a “Pacifism Score” (P‑Score) and refusal behavior. Translation for buyers: model choice, policies, and guardrails should not assume identical safety under conflict—they aren’t. Why this matters now Most safety work measures what models say (toxicity, misinformation). PacifAIst measures what they would do when a safe choice may require self‑sacrifice—e.g., dumping power through their own servers to prevent a human‑harmful explosion. That’s closer to agent operations (automation, tool use, and control loops) than classic content benchmarks. If you’re piloting computer‑use agents or workflow copilots with action rights, this is the missing piece in your risk model. ...