Good AI Goes Rogue: Why Intelligent Disobedience May Be the Key to Trustworthy Teammates

TL;DR for operators

Most enterprise AI design still treats obedience as the default virtue. The assistant should follow instructions, complete the task, minimise friction, and avoid acting like a tiny bureaucrat in a chat window. Sensible enough. Also dangerously incomplete.

Reuth Mirsky’s paper on artificial intelligent disobedience argues that useful AI teammates may need the bounded ability to refuse, interrupt, escalate, or override human instructions when compliance conflicts with a persistent mission such as safety, task success, or team welfare.¹ The point is not to build rebellious machines with main-character syndrome. The point is to stop pretending that trustworthy assistance equals cheerful compliance.

For operators, the practical lesson is this: every serious AI workflow needs an explicit override policy. Not just content filters. Not just “human in the loop,” that magical phrase people sprinkle over risk until the slide deck smells regulatory. A real policy should specify when an agent may warn, ask for clarification, refuse, modify a command, pause execution, transfer control, or act despite a human’s immediate preference.

The paper is conceptual, not experimental. It does not prove that today’s LLM agents can safely reason through complex moral conflicts. It gives a taxonomy and a design vocabulary. That matters because many business failures in agentic systems will not come from agents being too independent. They will come from agents being obedient in exactly the wrong way.

The dangerous instruction is usually a normal instruction

The familiar enterprise fantasy is clean delegation. A human gives an instruction. The AI executes. The workflow accelerates. Everyone claps politely, preferably near an ROI chart.

But many real instructions are defective in ordinary ways. They are incomplete, stale, ambiguous, unsafe, illegal, misaligned with policy, or based on a human’s partial view of the environment. A procurement agent may be told to approve a vendor faster. A finance copilot may be asked to “smooth” numbers for a presentation. A medical scheduling assistant may be asked to ignore a dosage warning because the physician is busy. A warehouse robot may be instructed to keep moving even when conditions have changed.

The paper’s central move is to treat these cases not as edge-case annoyances but as a core test of agency. If an AI system has a persistent mission, then the immediate user instruction is not the only thing it is serving. That is the crucial shift. The user’s command becomes one input into a deliberation process, not the sovereign law of the machine.

This is where “intelligent disobedience” becomes less theatrical than it sounds. A guide dog refusing to cross a road is not staging a coup. It is preserving the handler’s deeper objective: arrive safely. A grammar tool interrupting a user about a critical legal typo is not attacking human dignity. It is trading momentary irritation for document integrity. A teleoperated robot refusing a collision path is not being difficult. It is doing the part of teamwork that humans often claim to want from machines: noticing what the human missed.

The business translation is blunt. If your agent cannot distinguish “do what I said” from “help me achieve what I meant safely,” then it is not a teammate. It is a faster button.

The override mechanism: from mission to mediation

The paper’s most useful contribution is not the slogan that AI should sometimes disobey. That slogan is cheap. The useful part is the mechanism that makes disobedience bounded rather than decorative chaos.

Mirsky builds on a five-step model of intelligent disobedience: global objectives, local objectives, plan recognition, consistency checking, and mediation. Read as an operating model, it looks like this:

Step	What the agent must model	Enterprise interpretation	Failure mode if missing
Global objective	The standing mission the agent should preserve	Safety, compliance, customer welfare, asset protection, contractual duty	The agent obeys harmful or policy-violating instructions
Local objective	What the human is asking for now	The requested task, deadline, preference, or workaround	The agent blocks harmless actions because it misunderstands the request
Plan recognition	How the human intends to reach the local objective	Inferred workflow, dependencies, hidden assumptions	The agent cannot detect that the chosen path is unsafe or ineffective
Consistency check	Whether the plan conflicts with local or global objectives	Risk check, policy check, feasibility check	The agent either rubber-stamps or overreacts
Mediation	What to do when conflict appears	Warn, ask, refuse, modify, escalate, or override	The agent jumps from detection to blunt refusal with no useful path forward

This mechanism is the difference between “AI refuses commands” and “AI helps preserve the actual mission.” A refusal without plan recognition is just obstruction. A consistency check without mediation is a dead-end alert. A global objective without clear boundaries becomes a philosophical fog machine with an API.

The mediation step is especially important for business systems. In most enterprise contexts, the right answer is not binary obedience or rebellion. It is graded intervention. The agent may say: “I can do that, but it violates the retention policy.” Or: “I will pause execution and ask for approval.” Or: “I can proceed with option B, which meets the deadline without exposing customer data.” This is where intelligent disobedience becomes operationally valuable. It converts conflict into a safer alternative path.

Naturally, this also makes the system harder to design. Obedience is simple because it hides judgment. Disobedience is difficult because it exposes judgment and demands accountability. Welcome to autonomy. It was never going to fit neatly inside a demo script.

The L0-L5 scale explains how much disobedience is appropriate

The paper proposes six autonomy levels, from L0 to L5. Its important insight is that intelligent disobedience should match the agent’s autonomy level. A support tool should not behave like an unconstrained executive agent. A high-autonomy system cannot be governed as if it were a spell-checker with better branding.

Level	Autonomy profile	What disobedience can look like	Business reading
L0	No autonomy	Only perceived disobedience, such as malfunction or misinterpretation	Do not anthropomorphise failure; this is reliability engineering
L1	Support	Proactive suggestion, interruption, clarification, or limited correction	Useful for copilots, writing tools, decision support, and service workflows
L2	Occasional autonomy	Ignoring or modifying unsafe human input in defined subtasks	Relevant to shared-control systems, workflow agents, compliance checks
L3	Limited autonomy	Taking temporary control when the system can outperform the human in a bounded condition	Plausible in navigation, robotics, and narrow safety-critical operations
L4	Full autonomy under constraints	Breaking a constraint in exceptional circumstances to preserve a higher objective	High-risk, domain-specific, and heavily dependent on governance
L5	Full autonomy	Revising or rejecting the mission itself	Mostly speculative for current business deployment, unless one enjoys legal migraines

This scale matters because “agentic AI” is often discussed as if autonomy were a single slider. More autonomy, more power, more productivity, add procurement approval and stir. The paper gives a cleaner vocabulary. It separates support, occasional autonomy, limited autonomy, constrained full autonomy, and unconstrained full autonomy. Those are not branding tiers. They are different accountability regimes.

The L1 cases are already familiar. Clarification requests, proactive suggestions, automatic corrections, and interruptions can all count as low-level forms of disobedience when they push against the user’s immediate instruction or expectation. This is not exotic. It is the daily irritation of software that thinks it knows better. Sometimes it does. Sometimes it is Clippy wearing a nicer suit.

The L2 and L3 cases are where the concept becomes more strategically important. A system that can ignore a command to avoid harm in a defined subtask is no longer merely advising. It is sharing control. In navigation and robotics, this is easier to justify because the environment often supplies clearer feedback: obstacle, collision, route, velocity, hazard. The paper notes that L3 disobedience appears more plausible in such “shallow” domains because the goals and rules are comparatively explicit.

By contrast, surgery, education, mental health, finance, and legal advice contain layered goals, delayed outcomes, contested values, and human trust dynamics. An agent that overrides a user in those settings is not merely avoiding a chair in a hallway. It is making a judgment inside a social, legal, and ethical system. That difference should make product teams sweat a little. Sweating is underrated. It prevents nonsense roadmaps.

The paper’s evidence is a taxonomy of cases, not a benchmark leaderboard

There are no experiments in this paper. No ablation table. No benchmark curve. No “our method improves by 14.7%,” followed by the usual ceremonial overclaim. The evidence is conceptual and comparative: examples from guide dogs, grammar checkers, assistive robots, social navigation, shared-control systems, autonomous vehicles, value alignment, and human-robot trust.

That does not make the paper weak. It means we should read it correctly. Its purpose is to name a capability, organise its forms, and define initial boundaries for research. The table of autonomy levels is not a result in the empirical sense. It is a classification device. The examples are not proof that disobedience always improves performance. They are case-based support for the claim that strict obedience is an insufficient design ideal.

Paper element	Likely purpose	What it supports	What it does not prove
Guide dog analogy	Main conceptual anchor	Disobedience can preserve the handler’s deeper goal	That AI systems can replicate animal-trained judgment safely
L0-L5 autonomy scale	Taxonomy	Different autonomy levels require different override rights	That these levels are exhaustive or easy to certify
Grammar checker and assistant examples	Familiar low-level cases	Proactive correction and interruption can be useful	That users will tolerate frequent overrides
Navigation and shared-control examples	Applied support from robotics	Bounded overrides are plausible where goals and feedback are clear	That the same logic transfers cleanly to complex social domains
HAL 9000 and paperclip-style risks	Boundary warning	Disobedience without alignment can become catastrophic	That fictional examples settle technical design questions
Transparency and accountability discussion	Governance framing	Overrides need explanation, trust calibration, and responsibility assignment	That existing institutions know how to govern high-autonomy agents

This distinction is important for enterprise readers because conceptual papers are easy to misuse. The lazy version of the takeaway is: “Our AI should be empowered to override users.” No. The better version is: “Our AI system needs a formally designed intervention ladder, matched to its autonomy level, with transparent reasons and accountable thresholds.”

One is product theatre. The other is systems design.

The paper’s strongest business implication is that obedience should be treated as a risk surface.

This is counterintuitive because enterprises usually worry about AI systems doing too much. That concern is valid. But in many deployed workflows, the immediate danger is narrower and more boring: the AI follows a bad instruction efficiently. It drafts the non-compliant email. It approves the suspicious transaction. It summarises a document while ignoring a confidentiality marker. It executes a workflow step that the user did not realise would trigger downstream consequences.

A compliant AI can still be unsafe. In fact, compliance can make it more unsafe because speed removes the friction where humans used to notice mistakes.

The operational response is not to make every agent stubborn. It is to define categories of disobedience.

Intervention type	Example behaviour	Appropriate for	Product requirement
Nudge	“This may conflict with policy.”	Low-risk advisory tools	Lightweight warning, low interruption cost
Clarify	“Do you mean X or Y?”	Ambiguous tasks	Intent recognition and uncertainty threshold
Refuse	“I cannot perform that action.”	Disallowed or unsafe requests	Clear policy boundary and reason
Modify	“I will do a safer version instead.”	Constrained automation	Approved fallback actions
Escalate	“Human approval required.”	Regulated or high-impact workflows	Audit trail and routing
Override	“I am pausing/taking control to prevent harm.”	Safety-critical bounded systems	Certified thresholds, monitoring, accountability

This ladder is where the paper becomes practical. Most AI governance conversations obsess over whether a system is allowed or not allowed to act. The harder question is how it should intervene when user instruction conflicts with mission. A good override design does not merely block. It mediates.

For an enterprise agent, that mediation might include a policy citation, an alternative action, a confidence estimate, a required approval path, or a logged explanation. The agent’s refusal should not feel like a locked door in a burning building. It should feel like a trained colleague saying, “That route is unsafe; here is the viable one.”

The teammate metaphor is useful only if it changes the design

Calling AI a teammate is usually harmless marketing fluff. The paper gives the metaphor teeth.

A teammate is not valuable because it obeys perfectly. A teammate is valuable because it contributes capabilities the team lacks. It notices, challenges, complements, and occasionally blocks. A junior analyst who never questions a flawed assumption is not “aligned”; they are decorative payroll. An AI agent that never pushes back may be similarly charming and similarly useless.

But teammate status also changes expectations. If the agent can override, users need to understand when and why. Otherwise disobedience becomes indistinguishable from unreliability. The paper points to trust, transparency, and accountability as necessary companions to autonomy. This is not a soft human-factors footnote. It is deployment infrastructure.

An override without explanation damages trust. An explanation delivered at the wrong time creates cognitive overload. A perfectly accurate intervention that users cannot anticipate may still fail operationally because people route around systems they do not trust. Anyone who has watched employees build shadow workflows around enterprise software will recognise the pattern. People do not rebel against tools because they hate governance. They rebel because the tool feels stupid, opaque, or slow. Often all three. Ambitious software.

So the design question is not simply “Can the agent disobey?” It is: can the agent disobey in a way that remains legible to the human team?

That suggests a few practical requirements:

Design requirement	Why it matters
Stable mission hierarchy	Users need to know what the agent is optimising above immediate compliance
Visible intervention categories	A warning, refusal, and override should not feel like the same event
Demand-driven explanation	Users should be able to inspect reasons without being buried in justification every time
Auditability	Overrides need logs, thresholds, and accountable ownership
Human-designer control	The authority to define override rules should remain governed, tested, and reviewable
Domain-specific calibration	Navigation, finance, medicine, and HR cannot share the same disobedience policy

This is where Cognaptus reads the paper as more than an AI safety argument. It is also a product architecture argument. As agentic systems move from chat to workflow execution, override behaviour becomes part of the interface contract.

L4 and L5 are where the argument becomes speculative, and that is fine

The paper is careful about high autonomy. Disobedience at L4 and beyond is not common in current AI systems, and the required deliberation is largely beyond present technology. That boundary should not be softened. It should be printed on the box.

L4 disobedience means an agent with full autonomy under constraints may violate a constraint to preserve a higher objective. The paper gives examples such as a warehouse robot leaving its normal operating boundary during an emergency, or an autonomous vehicle exceeding a speed constraint during a medical crisis. These examples clarify the logic, but they also expose the governance problem. Who defined the emergency threshold? What evidence must the agent have? How is the decision audited? What happens when the agent is wrong? Who gets sued? The robot, tragically, has no calendar availability for deposition.

L5 is even stranger. If an agent has full autonomy and no human instruction remains to override, disobedience becomes rejection or revision of its persistent mission. In business terms, that is not an advanced feature. It is a constitutional crisis inside your software stack.

This does not make the L4-L5 discussion useless. It makes it a boundary marker. The paper’s taxonomy helps distinguish deployable low-level intervention from speculative high-level agency. That is valuable because enterprise AI conversations often blend these categories into one vague future. A customer-service bot refusing a prohibited refund request is not the same class of system as an autonomous agronomist revising the farm’s crop strategy against its original mission. One is policy enforcement. The other is institutional delegation.

The distinction matters because accountability scales with autonomy. The higher the level, the less plausible it becomes to treat the agent as merely a tool carrying out user intent.

Where Cognaptus would apply the framework first

The paper recommends beginning in fully cooperative settings with persistent mission alignment and clearly defined operational limits. That is the right instinct. Do not begin with law enforcement, adversarial negotiation, or open-ended moral reasoning unless the project plan includes a crater.

In business environments, the best first applications are domains where the mission is explicit, the risk is bounded, and the intervention ladder can be tested.

Good candidates include:

Domain	Sensible disobedience pattern	Why it is tractable
Document compliance	Refuse or modify wording that violates policy	Clear rule base and audit trail
Finance operations	Escalate suspicious approvals or inconsistent figures	Existing controls and review workflows
Healthcare administration	Flag dosage, identity, or scheduling conflicts	High stakes but structured checks
Cybersecurity operations	Block unsafe commands or require confirmation	Strong precedent for policy-based refusal
Robotics and logistics	Pause or reroute around hazards	Sensor feedback and physical safety thresholds
Customer support	Refuse prohibited remedies while offering allowed alternatives	Policy hierarchy can be encoded and reviewed

Poor candidates are domains where the agent must infer contested values, hidden incentives, or long-term welfare with weak feedback. Executive strategy. Mental health counselling. Legal judgement. Performance management. Political persuasion. In those spaces, intelligent disobedience may still matter, but the intervention should usually be framed as warning, clarification, or escalation rather than autonomous override.

This is the practical boundary: build disobedience where the mission hierarchy is explicit. Avoid pretending the agent has moral wisdom because it can format a memo and quote policy with confidence.

The hidden cost is not refusal. It is governance

The paper’s argument implies a cost structure many AI adoption plans understate.

If an agent can disobey, someone must define the conditions under which disobedience is allowed. Someone must test those conditions. Someone must handle appeals. Someone must measure false positives and false negatives. Someone must decide whether users can override the override. Someone must document the system well enough that auditors, customers, and operators do not have to consult a séance to understand what happened.

That work is not glamorous. It is also where trustworthy agentic AI will be won.

The enterprise version of intelligent disobedience therefore requires at least four layers:

Mission design: What objective outranks the immediate user instruction?
Conflict detection: How does the agent identify tension between instruction, plan, and mission?
Mediation policy: Does the agent warn, clarify, refuse, modify, escalate, or override?
Accountability system: How are interventions explained, logged, reviewed, and corrected?

Most current agent deployments focus on task execution and tool use. They are proud of chaining actions together. Lovely. Chains are also how accidents propagate. The more tools an agent can call, the more important it becomes to specify when it should stop.

That is the paper’s quiet sting. Agency is not only the ability to act. It is the ability to refrain, resist, and redirect.

What the paper directly shows, and what it does not

To keep the interpretation clean:

Category	Reading
What the paper directly provides	A conceptual framework for artificial intelligent disobedience, an L0-L5 autonomy scale, examples across AI and robotics, and initial design boundaries
What it argues	Cooperative AI systems should sometimes override human instructions when obedience would undermine safety, mission, or team goals
What Cognaptus infers for business	Enterprise agents need formal intervention ladders and override policies, not just instruction-following and generic safety filters
What remains uncertain	Whether current LLM agents can reliably perform plan recognition, conflict detection, and mediation in complex open-ended domains
Where the framework is strongest	Bounded cooperative settings with explicit goals, observable hazards, and clear escalation channels
Where it is weakest	High-autonomy ethical deliberation, adversarial settings, contested values, and mission revision

This is not a defect in the paper. It is its research agenda. The author is not claiming that intelligent disobedience has been solved. The claim is that the field needs to treat it as a core capability of artificial teammates rather than an awkward exception hidden under “safety.”

Conclusion: trustworthy agents will need a controlled way to say no

The cleanest misconception to remove is that disobedience equals misalignment. Sometimes it does. HAL 9000 is not the onboarding video. But blind obedience is not alignment either. A system that follows harmful, incoherent, or policy-breaking instructions is not trustworthy just because it is compliant.

The better design target is bounded, transparent, mission-preserving resistance. The agent should know what it is for. It should understand what the user is trying to do. It should recognise when the plan conflicts with the mission. It should mediate before it overrides. And when it does override, the humans responsible for the system should be able to explain why.

That is a higher bar than building agents that click buttons faster. It is also closer to the real promise of agentic AI. The future teammate is not the one that says yes most fluently. It is the one that knows when yes would be stupid.

Cognaptus: Automate the Present, Incubate the Future.

Reuth Mirsky, “Artificial Intelligent Disobedience: Rethinking the Agency of Our Artificial Teammates,” arXiv:2506.22276, 2025. ↩︎

TL;DR for operators#

The dangerous instruction is usually a normal instruction#

The override mechanism: from mission to mediation#

The L0-L5 scale explains how much disobedience is appropriate#

The paper’s evidence is a taxonomy of cases, not a benchmark leaderboard#

Blind compliance is a product risk, not a virtue#

The teammate metaphor is useful only if it changes the design#

L4 and L5 are where the argument becomes speculative, and that is fine#

Where Cognaptus would apply the framework first#

The hidden cost is not refusal. It is governance#

What the paper directly shows, and what it does not#

Conclusion: trustworthy agents will need a controlled way to say no#