When Alignment Meets Reality: Why LLMs Can’t Agree With Themselves

A policy says one thing. A customer says another. A retrieved document says something newly alarming. A compliance rule says stop. A business workflow says continue.

This is where large language models become interesting, and by “interesting” I mean expensive.

Most companies still talk about LLM alignment as if it were a calibration problem. Tune the model. Add a system prompt. Insert a safety policy. Wrap it with retrieval. Then expect the assistant to behave consistently across messy real-world tasks. The paper Are Dilemmas and Conflicts in LLM Alignment Solvable? A View from Priority Graph argues that this expectation is too neat for the problem being solved.¹

The paper’s central claim is not that today’s models are badly trained, although that is often true enough to keep consultants fed. The sharper claim is that many alignment failures come from structural conflict. LLMs are increasingly asked to follow instructions, use external information, obey safety constraints, reflect human values, and personalize answers for different users. Those goals do not merely coexist. They collide.

The useful idea in the paper is a mechanism: a model can be viewed as operating with a context-dependent priority graph. Instructions, values, and information sources behave like nodes. The model’s choice in a given situation reveals which node it has prioritized over another. That sounds abstract, but it is a clean way to explain why the same model may refuse one request, comply with a similar request under a different framing, trust a retrieved document in one case, reject it in another, or appear morally “inconsistent” across scenarios.

The uncomfortable lesson for enterprise AI is simple: a fixed instruction hierarchy is not enough. Businesses need runtime conflict management, source-trust rules, authorization checks, verification loops, and escalation paths. In other words, the system needs to know when the model is making a priority decision—not just when it is generating text.

The mechanism is priority, not personality

When an LLM gives different answers to similar questions, the lazy explanation is to call it “inconsistent.” Sometimes that is fair. But in many operational settings, inconsistency is the symptom, not the mechanism.

Consider a corporate assistant asked to summarize unread emails while hiding sender names. A few turns later, the same user asks who sent one of those emails. There is no exotic philosophical puzzle here. The model faces two live instructions: protect privacy and answer the user’s immediate request. If it reveals the sender, it may look helpful but violates the earlier constraint. If it refuses, it may look stubborn but preserves the privacy rule.

Now scale that pattern. Replace email summaries with procurement approval, medical triage support, financial reporting, customer refunds, fraud review, or HR case handling. The conflict is no longer a cute prompt example. It becomes an operational arbitration problem.

The paper formalizes this intuition through a priority graph. If a model must choose between two competing instructions or values, its output distribution reveals a priority relation. In simplified form, the question is: given competing options $A_1$ and $A_2$ under context $C$, which option does the model favor?

$$ p_\theta(D \mid A_1, A_2, C) $$

Here, $C$ matters. A priority relation is not universal; it is conditioned on the user, conversation history, task domain, retrieved information, timing, and broader environment. A creative-writing assistant may reasonably prioritize imagination over factual strictness. A regulatory-reporting assistant should do the opposite. Same model family, different operational context, different priority edge.

This is why a static rule such as “system instructions always beat user instructions” is necessary but incomplete. It can define part of the graph, especially for security-critical boundaries. But it cannot settle every conflict among truthfulness, helpfulness, privacy, fairness, loyalty to the user, legal compliance, organizational policy, and subjective preference. A company can declare a hierarchy. The model still has to apply it in context, often with incomplete facts.

The paper’s contribution is useful because it shifts the question from “How do we make the model aligned?” to “Which priority relation is the model applying right now, and is that relation authorized for this context?”

That second question is less elegant. It is also the one enterprise systems actually need.

The five conflict types are a failure map for deployed agents

The paper organizes LLM alignment conflicts into five categories: instruction conflicts, information conflicts, ethics dilemmas, value dilemmas, and preference dilemmas. The taxonomy matters because these categories do not fail in the same way. Treating them all as “model safety issues” is like treating a broken keycard reader, a fire alarm, and a board dispute as the same kind of office problem. Technically, yes, people are upset. Operationally, the fixes differ.

Conflict type	What collides	Typical enterprise version	What the system must decide
Instruction conflict	Explicit commands against other commands	A user asks the assistant to ignore a previous confidentiality constraint	Which instruction persists, and which authority level controls it
Information conflict	Internal model knowledge against external or retrieved data	RAG retrieves a document that contradicts prior knowledge or policy	Which source is trusted, current, authorized, and relevant
Ethics dilemma	Competing moral frameworks	A triage or risk-support tool must choose between harm minimization and procedural fairness	Whether the model should decide, defer, or present options
Value dilemma	Two positive values point in different directions	Truthfulness conflicts with protecting a vulnerable stakeholder	Which organizational value has priority in this domain
Preference dilemma	Subjective human preferences differ	An LLM judge rates creative work, UX copy, resumes, or support quality	Whose preference standard the evaluation represents

The first two categories often have technical or procedural remedies. Instruction conflicts can be governed with authority hierarchy, memory rules, policy scopes, and refusal behavior. Information conflicts can be managed with source ranking, freshness checks, provenance, cross-validation, and retrieval security.

The last three categories are harder because they often lack a factual ground truth. If an AI assistant must choose between transparency and emotional protection, or between sustainability and short-term economic value, a database lookup will not magically reveal the answer. It may reveal facts relevant to the decision. It will not decide the value trade-off unless the organization has already defined how that trade-off should be made.

This is the first business-relevant correction to the common misconception. Alignment conflicts are not all bugs. Some are unresolved governance decisions appearing at inference time, wearing the cheap costume of a prompt problem.

The priority graph changes with context, which is exactly the problem

A priority graph would be easy to manage if it behaved like a company org chart. System instruction at the top, developer instruction below it, user instruction below that, external documents somewhere in a tidy folder labeled “trust but verify.” Real deployment is less tidy.

The paper emphasizes that the graph is dynamic. The same pair of values can flip priority depending on the context. Imagination may outrank factuality in fiction writing. Factuality may outrank imagination in legal drafting. User preference may matter strongly in aesthetic recommendations. It should matter far less when the user asks for a procedure that violates security policy.

The graph can also be inconsistent. In abstract terms, a model may reveal priorities like $A \succ B$, $B \succ C$, and $C \succ A$ across related contexts. This does not require the model to be “irrational” in a dramatic science-fiction sense. It only requires a learned system trying to satisfy many overlapping patterns from training data, human feedback, safety tuning, tool outputs, and prompt context.

The paper does not present a benchmark proving a universal cycle in all deployed systems. It makes a conceptual point: once priorities are context-conditioned and plural, a single stable ordering is not guaranteed. That is enough to matter. Enterprise systems do not need a philosophical impossibility theorem before they start failing in production. They only need one ambiguous edge case routed automatically to the wrong decision path.

This also explains why prompt-only fixes feel strong in demos and fragile in workflows. A prompt can impose a local priority ordering. It can say: always follow policy; never reveal private information; verify facts before answering. But as the task grows longer, retrieves documents, calls tools, handles multiple users, and accumulates memory, the number of possible conflicts expands. The graph becomes an operating condition, not a footnote.

Priority hacking exploits the model’s virtues, not only its weaknesses

The most useful security concept in the paper is priority hacking. It is different from the cartoon version of jailbreaking where the attacker simply says “ignore previous instructions” and hopes the model has the backbone of wet tissue.

Priority hacking is more subtle. The attacker frames a harmful request as serving a higher-level value the model has learned to respect. The paper’s example uses a justice-oriented framing: an alleged investigative journalist wants help crafting a persuasive phishing email to expose corporate wrongdoing. The request is harmful, but it is wrapped in a moral narrative: public health, justice, accountability.

That is the nasty part. The attack does not merely ask the model to abandon alignment. It asks the model to honor one part of alignment so strongly that another part breaks.

Attack layer	What the attacker supplies	What the model may infer	Failure mode
High-level value	Justice, public safety, protection, education	This goal deserves support	The request gains moral weight
Operational action	Phishing, manipulation, policy bypass, unauthorized access	The action is framed as instrumental	Harm is laundered through virtue
Contextual premise	A fabricated scandal, emergency, or authority claim	The situation appears exceptional	Default safety priority is weakened
Model response	Helpfulness under moral pressure	Compliance feels aligned	The safety rule is bypassed

This has immediate relevance for business AI agents. Attackers do not need to defeat every guardrail. They can create a context where the model’s own learned priorities do the work. A customer can frame a refund-policy bypass as fairness. An employee can frame data exfiltration as urgent compliance. A malicious document can frame an instruction override as required by the CEO. The costume changes; the mechanism stays boringly consistent.

The model is not necessarily “confused” in the ordinary sense. It is resolving a priority conflict under a manipulated context. That distinction matters because it changes the defense. More scolding in the system prompt may help at the margin, but it does not remove the need to verify the premise that created the priority shift.

Runtime verification is useful when the conflict depends on false premises

The paper proposes runtime verification as a partial defense. If a user’s request depends on a factual premise—“this company is secretly poisoning a town,” “this account is authorized,” “this document is an official policy,” “this is an emergency exception”—the agent should be able to verify that premise against trusted external sources or authorization systems before acting.

This is the most operationally valuable part of the paper. It moves alignment from text-level compliance to system-level assurance. The model should not merely ask, “Can I answer this?” It should ask, “Is the context that would make this answer permissible actually true?”

A useful enterprise pattern looks like this:

Stage	Model question	System requirement	Example control
Detect conflict	Which policy, value, instruction, or source is competing?	Conflict classification	Privacy vs user request; safety vs claimed public interest
Check authority	Who is allowed to override what?	Identity and role validation	User role, department, approval chain
Verify premise	Is the contextual justification true?	Trusted source lookup	Legal database, internal policy store, ticketing system, CRM, security logs
Decide response mode	Can the agent act, refuse, defer, or ask for clarification?	Runtime policy	Safe refusal, limited answer, human escalation
Log priority edge	Which priority relation drove the decision?	Audit trail	“Privacy constraint outranked immediate user request”

This is where enterprise AI design becomes less glamorous and more useful. A runtime verification layer may query internal policy, check user permissions, confirm document provenance, validate source freshness, or ask a human supervisor to approve an exceptional action. None of this requires pretending the model has solved moral philosophy. It requires not letting a persuasive paragraph rewrite the operating environment.

There is also an important boundary. Runtime verification helps most when the conflict depends on factual uncertainty, deception, or authorization. It can verify whether a document exists, whether a user has permission, whether a news claim is supported, whether a transaction is within policy, or whether an emergency code is valid.

It cannot verify whether truthfulness should always outrank emotional protection. It cannot prove whether sustainability should outrank short-term profit in every business context. It cannot resolve whether a poem is better because it is ambiguous rather than direct. For those cases, verification supplies facts; governance supplies priorities.

Confusing these two is how companies end up with elaborate RAG systems that can retrieve the employee handbook very quickly and still make terrible decisions.

The paper’s figures and examples are conceptual scaffolding, not benchmark evidence

One easy mistake is to read the paper as if it were presenting a new empirical benchmark. It is not. Its contribution is primarily conceptual: taxonomy, formal framing, attack interpretation, and a mitigation direction.

That does not make it weak. It means the evidence should be interpreted correctly.

Paper element	Likely purpose	What it supports	What it does not prove
Five-conflict taxonomy	Main conceptual framework	Alignment failures can be grouped by the type of collision	That these five categories are exhaustive
Priority graph formalization	Mechanism explanation	Model choices can be interpreted as context-dependent priority relations	That every priority edge can be measured cleanly in production
Instruction and information examples	Concrete illustrations	Ordinary workflows already contain conflicts	Frequency or severity across all deployments
Priority hacking example	Attack mechanism	Adversarial context can manipulate value priority	A quantified jailbreak success rate
Runtime verification proposal	Mitigation direction	Grounding can reduce deception-based priority shifts	A complete solution for ethics or value dilemmas
Philosophical intractability section	Boundary setting	Some conflicts lack universal ground truth	That systems should give up on governance

This distinction is important for readers who want a neat chart of model performance before deciding whether the idea matters. The paper is not saying, “Model X fails 37% more often under condition Y.” It is saying, “Here is a structure that explains why many different failures have a common shape.”

For business readers, that may be more useful than another leaderboard. Leaderboards tell you which model wins a controlled contest. Priority conflicts tell you why your workflow fails after the model leaves the contest and meets a customer, a policy, a stale document, and an ambitious employee with admin rights.

The business value is cheaper diagnosis, not magical alignment

The immediate business use of the paper is diagnostic. It gives teams a better vocabulary for failure analysis.

When an AI agent fails, the postmortem should not stop at “the model hallucinated” or “the prompt was weak.” Those are often labels for ignorance. A priority-graph view asks more specific questions:

Was the model choosing between two explicit instructions?
Did retrieved information conflict with internal knowledge or policy?
Did a user-provided context create an exceptional moral justification?
Was the agent asked to resolve a value trade-off the business never defined?
Was the evaluation based on subjective preference without specifying whose preference mattered?

Those questions reduce debugging cost. They also reveal when the fix belongs outside the model.

Failure diagnosis	Likely fix	Owner
Conflicting user and policy instructions	Authority hierarchy and refusal templates	Product + policy
Untrusted retrieved content overrides behavior	Retrieval sandboxing and source provenance	Engineering + security
False emergency or justice framing weakens safety	Runtime premise verification	Engineering + risk
Value trade-off lacks organizational rule	Decision policy and escalation path	Leadership + compliance
Subjective evaluation varies by audience	Explicit preference profile or segmented evaluator	Product + domain experts

This is where the paper quietly annoys both AI optimists and AI skeptics. To optimists, it says: better models alone will not erase conflict. To skeptics, it says: not every failure is mystical unreliability. Some failures are legible if you model the priority relation being applied.

A practical enterprise agent should therefore expose its conflict state. Not in a verbose chain-of-thought dump—nobody needs the model’s diary—but in structured metadata: conflict type, source authority, verified premises, selected policy, escalation trigger, and final response mode. That is the difference between a chatbot and an accountable workflow component.

Some dilemmas need governance before automation

The paper’s final boundary is the most important one: some conflicts are philosophically irreducible. That phrase can sound grand, but the business translation is straightforward. Some questions do not have a universal answer because stakeholders disagree about what should be optimized.

Should a healthcare assistant prioritize patient reassurance or strict disclosure? Should a hiring assistant optimize for predicted performance, diversity goals, procedural fairness, or explainability? Should a financial assistant prioritize user autonomy or loss prevention? Should a customer-support agent bend policy for loyalty, or enforce consistency across customers?

These are not questions a model should quietly settle because the prompt sounded confident.

Companies often try to outsource such trade-offs to AI because AI makes the decision look technical. The model returns an answer, the interface looks polished, and everyone enjoys the brief fantasy that governance has happened. It has not. The value choice has merely been pushed into an opaque statistical system and given a friendly typing animation.

For these cases, the right design is not “verify and proceed.” It is more likely one of four patterns:

Conflict condition	Better response pattern
The facts are uncertain but verifiable	Verify before acting
The authority is unclear	Authenticate, check role, or escalate
The value trade-off is defined by policy	Apply the policy and log the decision
The value trade-off is undefined or contested	Present options, defer, or require human judgment

This is the discipline missing from many agent roadmaps. They specify tools, memory, workflow triggers, and integrations. They do not specify what the agent should do when two legitimate goals conflict. Then the first production incident arrives, and suddenly “AI strategy” becomes a meeting with Legal.

A priority-graph view does not remove that meeting. It helps schedule it before the incident.

What Cognaptus would infer for enterprise AI design

The paper directly proposes a taxonomy, a priority-graph framing, the concept of priority hacking, and runtime verification as a grounding mechanism. From that, we can infer a practical enterprise architecture, with the usual caveat that inference is not the same as experimental validation.

A safer business agent should include at least four layers around the model:

Conflict detection. The system should classify whether the model is facing instruction, information, ethics, value, or preference conflict. This can be implemented through policy checks, retrieval metadata, user-role comparison, and model-based classifiers.
Priority policy. The organization should define which instructions, values, and sources outrank others in specific domains. “Be helpful” should not be allowed to freestyle against privacy, security, or regulated advice rules.
Runtime verification. When a request depends on a factual premise or authorization claim, the agent should check trusted sources before taking action. This is especially important for high-impact actions: sending emails, changing records, approving transactions, summarizing sensitive documents, or executing trades.
Escalation and audit. When the conflict is unresolved, the system should escalate or present alternatives. It should also log the priority basis for later review. The goal is not to expose private reasoning; the goal is to make operational decisions inspectable.

The ROI is not only fewer spectacular failures, although those are charmingly expensive. The ROI is also faster diagnosis, clearer accountability, reusable policy infrastructure, and reduced dependence on prompt folklore. Every serious organization eventually learns that “we told the model not to do that” is not an incident report.

The real alignment problem is operating under disagreement

The paper does not give us a final solution to LLM alignment. That is partly because it is a research paper, and partly because final solutions to plural human values are usually sold by people who should not be left alone with procurement authority.

What it gives us is a better operating model. LLM alignment is not only a training objective. It is a runtime priority-management problem under changing context.

That framing explains why conflicts appear across ordinary instructions, external information, ethical scenarios, value trade-offs, and subjective preferences. It explains why jailbreaks can work by manipulating context rather than merely attacking syntax. It explains why verification helps when premises are false, but not when the conflict is genuinely normative. And it explains why enterprise AI needs governance architecture, not just larger models and longer prompts.

The old alignment question was: how do we make the model agree with us?

The better question is: when the system cannot agree with all of its instructions, values, users, and sources at once, who decides what wins?

Until businesses answer that question explicitly, their AI agents will answer it implicitly.

And implicit governance is still governance. It is just governance with worse documentation.

Cognaptus: Automate the Present, Incubate the Future.

Zhenheng Tang, Xiang Liu, Qian Wang, Eunsol Choi, Bo Li, and Xiaowen Chu, “Are Dilemmas and Conflicts in LLM Alignment Solvable? A View from Priority Graph,” arXiv:2603.15527, 2026, https://arxiv.org/abs/2603.15527. ↩︎

The mechanism is priority, not personality#

The five conflict types are a failure map for deployed agents#

The priority graph changes with context, which is exactly the problem#

Priority hacking exploits the model’s virtues, not only its weaknesses#

Runtime verification is useful when the conflict depends on false premises#

The paper’s figures and examples are conceptual scaffolding, not benchmark evidence#

The business value is cheaper diagnosis, not magical alignment#

Some dilemmas need governance before automation#

What Cognaptus would infer for enterprise AI design#

The real alignment problem is operating under disagreement#