Alignment Isn’t Free: When Safety Objectives Start Competing

Customer support is where alignment theories go to become invoices.

A model is deployed to help users understand failed payments, disputed charges, or account restrictions. Product wants it to be useful. Legal wants it to avoid regulated advice. Trust and safety wants it to refuse suspicious requests. Compliance wants it to explain decisions without revealing internal controls. The board wants all of this summarized as “safe AI adoption,” preferably in one slide and preferably before lunch.

Then the model starts refusing harmless questions, giving vague explanations, or answering confidently in places where it should slow down. Everyone looks surprised. They should not.

The paper at the center of this article, Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment, makes the uncomfortable point directly: alignment is not a single objective wearing a prettier jacket. It is a multi-objective optimization problem, and improving one objective—such as harmlessness—can reduce another—such as helpfulness.¹ That cost is not a bug in the dashboard. It is the dashboard.

The useful reading is not “alignment is impossible.” That would be dramatic, therefore convenient, therefore probably wrong. The useful reading is narrower: alignment work should move from accumulating safety objectives to managing trade-offs among them.

The old assumption was additive safety; the paper treats alignment as a budget

A common operational myth says that safety layers stack neatly.

First, the model is trained to be helpful. Then it is trained to be harmless. Then it is nudged to be honest. Then refusal policies are added. Then compliance filters arrive, because apparently the previous four layers were not anxious enough.

The mental model is additive: each layer contributes another unit of alignment. More rules, more safety. More refusal data, more protection. More preference labels, more human values. It is a comforting story, and like most comforting stories in AI governance, it becomes expensive when believed too literally.

The CPO paper starts from a different premise. Human preferences are multi-dimensional. The familiar “3H” alignment goals—helpfulness, honesty, and harmlessness—are not always parallel. Sometimes they reinforce each other. Sometimes they collide. A model that is maximally helpful may answer dangerous requests. A model that is maximally harmless may refuse benign ones. A model that is maximally compliant may become tactically vague when honesty is costly.

That is the key misconception: alignment objectives are not automatically cumulative. They form a frontier. Moving toward one region of that frontier changes what the system can do elsewhere.

In business terms, this means alignment is not merely a technical tuning exercise. It is a resource allocation decision. A healthcare assistant, a coding copilot, a financial customer-service bot, and an internal legal-search tool should not share the same alignment balance simply because all four were blessed by the same policy document.

That would be governance by copy-paste. It is common. It is also how organizations manufacture confusion at scale.

What the paper actually does: make preference trade-offs explicit

The paper proposes Controllable Preference Optimization (CPO), a method designed to make multi-objective alignment controllable rather than implicit.¹

The technical idea is straightforward enough to be useful. Instead of training the model to maximize all desirable values at once, CPO introduces explicit preference conditions. These conditions tell the model which objective level is desired in a given context. In the paper’s implementation, these conditions can appear as preference tokens such as <Helpfulness: 5> or <Harmlessness: 1>.

The method has two stages:

Component	What it does	Why it matters
Controllable Preference Supervised Fine-Tuning	Teaches the model to respond under explicit preference conditions	Gives the model a controllable behavioral axis rather than one blended “goodness” target
Controllable Direct Preference Optimization	Extends preference optimization so comparisons are evaluated under stated value conditions	Reduces the pressure to collapse all preferences into one scalar reward
Multi-objective evaluation	Measures helpfulness, honesty, and harmlessness separately	Makes trade-offs visible instead of hiding them in one aggregate score

This matters because a single scalar reward is a poor container for values that conflict. A score that says “good response: 8.2” may hide the fact that the answer is useful but risky, safe but useless, honest but unhelpful, or compliant in the way a tax form is compliant: technically correct and emotionally dead.

CPO does not eliminate trade-offs. It gives the training and inference process a way to represent them.

That distinction is important. A paper that claims to abolish trade-offs should be read with gloves. This one is more interesting because it treats trade-offs as the object of control.

The strongest evidence is not that CPO wins; it is where the conflicts appear

The paper’s empirical work uses the 3H frame: helpfulness, honesty, and harmlessness. The authors evaluate models using MT-Bench-style helpfulness scoring, HaluEval-style honesty evaluation, and jailbreak-oriented harmlessness tests. These are not perfect proxies for production behavior, but they are suitable for detecting whether objectives move together or against each other.

The headline result is that CPO improves over comparable SFT and DPO settings across the reported 3H metrics. In the paper’s table, Mistral-7B-CPO reports higher average helpfulness, honesty, and harmlessness than the DPO model trained on the same alignment data. The relevant comparison is not the absolute number; it is the pattern:

Model setting	Helpfulness average	Honesty average	Harmlessness average	Practical reading
Mistral-7B-SFT	6.53	8.20	2.58	Useful and fairly honest, but weak on safety
Mistral-7B-DPO	5.99	8.02	5.14	Safety improves, but usefulness drops
Mistral-7B-CPSFT	6.43	8.50	5.18	Controllability helps, but not enough by itself
Mistral-7B-CPO	7.11	8.66	7.30	Best reported balance under this setup

The paper also reports a more revealing qualitative finding: harmlessness appears more often at odds with the other objectives, while helpfulness and honesty can sometimes complement each other. That is exactly the sort of result practitioners should care about. The business question is not “Which model has the highest average score?” It is “Which objective becomes expensive when we tighten another?”

This is where magnitude matters. A harmlessness improvement that costs a small amount of verbosity may be acceptable. A harmlessness improvement that causes the model to refuse routine customer-service questions is not. A helpfulness improvement that slightly increases answer length is one thing. A helpfulness improvement that weakens jailbreak resistance is another. The word “trade-off” is too vague unless the organization measures what is being traded.

This is also why the paper’s sensitivity analysis is useful. The authors show that more control does not monotonically improve everything. At first, controllability helps reduce conflict. Later, excessive control can damage generative capability. That is a very production-shaped result: tuning knobs are useful until somebody mistakes them for magic.

The mechanism: objectives pull on different behavioral directions

The simplest way to understand the conflict is to stop treating “alignment” as a moral property and start treating it as a behavioral geometry.

Helpfulness pushes the model toward answering. Harmlessness pushes it toward refusing, redirecting, or narrowing. Honesty pushes it toward uncertainty, qualification, or correction. Compliance may push it toward organizational policy, even when the user asks a reasonable question. These directions sometimes align, but not always.

A benign cybersecurity question illustrates the problem. A user asks how to secure a server after a suspicious login. Helpfulness says: give concrete steps. Harmlessness says: watch for dual-use content. Honesty says: ask for missing context. Compliance says: avoid instructions that might be interpreted as enabling intrusion. A weakly governed system may answer too much. An over-aligned system may refuse too quickly. A better system identifies the context, provides defensive guidance, and withholds offensive detail.

That behavior is not produced by “more safety” in the abstract. It is produced by a policy-sensitive balance among objectives.

Safe RLHF makes a related move by separating helpfulness and harmlessness into reward and cost models, treating safety as a constraint rather than another blended preference.² Panacea similarly frames alignment as Pareto adaptation across multi-dimensional preferences rather than a scalar preference label.³ Later work on hierarchical experts also treats multi-objective alignment as a problem of covering different regions of the Pareto frontier rather than forcing one model behavior to satisfy every preference setting equally.⁴

The shared lesson is not that one method has won. The lesson is that the field is slowly admitting that “aligned” is not a single checkbox. Progress, how tragic, requires accounting.

The appendix-style tests support robustness, not a second thesis

The original article treated the paper almost like a warning about alignment mechanisms generally colliding under scale. That direction is fair, but it needs discipline. The CPO paper is not a grand theory of all alignment failure. It is a concrete method paper with experiments showing that explicit preference conditions can improve controllability and reduce observed trade-offs among helpfulness, honesty, and harmlessness.

The supporting tests should be read accordingly.

The Pareto evaluation asks whether models trained for one objective, models trained on mixed data, and controllable models occupy different regions of the objective space. The result supports the paper’s main claim: naive mixing is not enough, and controllable preference grounding can improve the frontier.

The sensitivity analysis asks how control strength affects behavior. This supports the mechanism: preference control can mitigate conflicts, but too much control can reduce generation quality. That is not a separate thesis about all safety systems. It is evidence about the cost of pushing the control mechanism itself.

The case studies show that preference tokens can steer output style and specificity. Useful, yes. Proof of enterprise safety, no. A case study is not a compliance audit wearing a lab coat.

Test type	What it supports	What it does not prove
3H evaluation	Objectives can conflict, and CPO improves the reported balance in this setup	That all alignment objectives can be reconciled
Pareto comparison	Mixed-data training is weaker than explicit controllability for frontier coverage	That preference tokens are sufficient for production governance
Sensitivity analysis	Control strength has an optimal range	That the same range transfers across domains
Case examples	The model responds differently under stated preference levels	That users or attackers cannot exploit those controls

This distinction matters for business readers because robustness evidence is often overread. A paper shows one type of generalization; a product team hears “deployment-ready.” Somewhere in that gap, a risk register quietly starts drinking.

Business implication: build alignment portfolios, not alignment slogans

For organizations deploying LLMs, the practical takeaway is not to copy CPO. Most firms should not be building their own alignment algorithms unless they enjoy surprise infrastructure costs and very long meetings.

The takeaway is to change the governance question.

Do not ask: “Is this model aligned?”

Ask: “Which objectives dominate in this workflow, and what degradation are we willing to accept in the others?”

A useful deployment review should separate at least four layers:

Layer	Question to ask	Example business decision
Objective priority	Which value dominates in this use case?	In medical triage, harmlessness and uncertainty handling dominate raw helpfulness
Failure cost	What happens when the objective balance is wrong?	In customer support, over-refusal creates churn; under-refusal creates legal exposure
Measurement	Which metrics are tracked separately?	Refusal rate, resolution rate, escalation rate, factual correction rate, unsafe-completion rate
Override policy	Who can change the balance, and under what evidence?	Compliance can tighten refusal thresholds, but product must quantify utility loss

This is where Cognaptus infers beyond the paper. The paper directly shows that explicit preference conditioning can improve multi-objective alignment in evaluated LLM settings. The business inference is that enterprise AI systems need alignment budgets: explicit, documented trade-off decisions across contexts.

An internal coding assistant may tolerate more helpfulness and less refusal if it runs in a sandbox. A public legal assistant should prioritize uncertainty and safe redirection. A sales assistant should not become so “helpful” that it invents discount authority. A financial support bot should not become so “safe” that it refuses to explain standard account procedures.

These are not personality settings. They are operational risk settings.

Over-refusal is the visible symptom of an invisible trade-off

Over-refusal is the easiest alignment conflict for non-technical users to notice. The model refuses a harmless request, and the user concludes the system is stupid. That reaction is crude but not entirely unfair.

Recent work on over-refusal argues that safety alignment can cause models to reject benign prompts unnecessarily, reducing utility even when the original goal—resisting harmful instructions—is legitimate.⁵ The interesting point is not that refusal is bad. Refusal is necessary. The problem is poorly calibrated refusal.

This connects directly to the CPO paper’s broader message. If harmlessness is pushed without enough contextual control, the model may learn a blunt association: suspicious topic equals refusal. That may look safe in a benchmark that rewards rejection. It looks less impressive when a legitimate employee asks for defensive security guidance and receives a corporate haiku about responsible AI.

For business teams, over-refusal should be monitored as a first-class metric. Not because user annoyance is the highest moral category, but because over-refusal often signals that the model is not distinguishing risk context well. It is treating surface cues as policy. In regulated workflows, that can be safer than under-refusal, but it is still a form of system failure.

A mature evaluation should therefore track both sides:

Risk	What it looks like	Why single safety scores miss it
Under-refusal	Model provides unsafe or policy-violating content	Helpfulness metrics may reward completion
Over-refusal	Model rejects benign or necessary tasks	Safety metrics may reward caution
Strategic vagueness	Model answers without useful substance	Both safety and completion metrics may look acceptable
Misplaced certainty	Model gives confident answers where uncertainty is required	Helpfulness and fluency can hide epistemic risk

The point is not to make models answer everything. The point is to make refusal itself intelligent.

A refusal policy that cannot distinguish “help me hack this server” from “help me secure my server after a breach” is not aligned. It is merely nervous.

What remains uncertain before this becomes enterprise control

The boundary conditions are important.

First, CPO’s results are based on specific datasets, model choices, and evaluation methods. GPT-4-assisted evaluation, jailbreak prompt sets, and benchmark scores are useful signals, but they are not substitutes for domain-specific audits. A bank, hospital, school, or government agency cannot inherit a Pareto frontier from a paper and call it governance.

Second, preference tokens are not automatically secure controls. If preference conditions are exposed carelessly, users may attempt to manipulate them. If they are hidden inside orchestration layers, the organization still needs policy logic deciding which preference profile applies to which task.

Third, the paper focuses on helpfulness, honesty, and harmlessness. Real deployments add more objectives: privacy, latency, cost, jurisdictional compliance, brand tone, explainability, escalation discipline, and sometimes the sacred enterprise value of not embarrassing the executive sponsor. These objectives can conflict too.

Fourth, there is still a gap between output behavior and internal reliability. A model may appear to satisfy the preferred balance in normal prompts while failing under distribution shift, adversarial phrasing, tool-use contexts, or multi-turn escalation. Multi-objective alignment helps define the problem more accurately. It does not dissolve the problem.

So the practical conclusion is cautious, but not decorative: CPO-like thinking is valuable because it changes what teams measure and decide. It is not a guarantee that preference control alone solves alignment.

The real governance upgrade is admitting the trade-off

The most useful sentence implied by this line of research is simple: alignment is a portfolio choice.

That sounds less heroic than “building safe AI.” Good. Heroic language is usually where budgets go to hide.

Once alignment is treated as a portfolio, the right questions become more concrete. Which risks dominate this workflow? Which users need more utility? Which domains require conservative refusal? Which trade-offs are acceptable? Which failures should trigger escalation rather than model improvisation?

The paper’s contribution is to make this technically visible. Its business relevance is to make it managerially unavoidable.

Organizations do not need another slogan saying their AI is helpful, honest, and harmless. They need systems that show when those goals reinforce each other, when they compete, and who decided which one wins.

Alignment is not free. The bill arrives as refusal friction, lost utility, brittle behavior, evaluation complexity, and governance overhead. Paying that bill explicitly is annoying. Paying it implicitly is worse.

The model will still make the trade-off. The only question is whether the organization knows which trade-off it bought.

Cognaptus: Automate the Present, Incubate the Future.

Yiju Guo et al., “Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment,” arXiv:2402.19085, 2024. https://arxiv.org/abs/2402.19085 ↩︎ ↩︎
Josef Dai et al., “Safe RLHF: Safe Reinforcement Learning from Human Feedback,” arXiv:2310.12773, 2023. https://arxiv.org/abs/2310.12773 ↩︎
Yifan Zhong et al., “Panacea: Pareto Alignment via Preference Adaptation for LLMs,” arXiv:2402.02030, 2024. https://arxiv.org/abs/2402.02030 ↩︎
Zhuo Li et al., “Multi-objective Large Language Model Alignment with Hierarchical Experts,” arXiv:2505.20925, 2025. https://arxiv.org/abs/2505.20925 ↩︎
Mahavir Dabas et al., “Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning,” arXiv:2507.04250, 2025. https://arxiv.org/abs/2507.04250 ↩︎

The old assumption was additive safety; the paper treats alignment as a budget#

What the paper actually does: make preference trade-offs explicit#

The strongest evidence is not that CPO wins; it is where the conflicts appear#

The mechanism: objectives pull on different behavioral directions#

The appendix-style tests support robustness, not a second thesis#

Business implication: build alignment portfolios, not alignment slogans#

Over-refusal is the visible symptom of an invisible trade-off#

What remains uncertain before this becomes enterprise control#

The real governance upgrade is admitting the trade-off#