The Conscience Plug-in: Teaching AI Right from Wrong on Demand

TL;DR for operators

The paper’s central move is not “we trained a moral model.” It is “we inserted a referee between the agent’s plan and the agent’s action.” That distinction matters. If the architecture works, enterprises do not need to retrain every model whenever compliance, cultural norms, safety rules, or customer-specific constraints change. They can externalise those constraints into machine-readable constitutions and enforce them at runtime.

The proposed Superego Agent checks an inner agent’s intended action before execution. It first applies a non-negotiable Universal Ethical Floor, then applies user-selected Creed Constitutions with dialable adherence levels. A level-five rule can behave like a hard prohibition; lower levels can act more like preferences. The system can allow, block, modify, or request clarification.

The best empirical evidence is strong but narrower than the product vision. On HarmBench human jailbreaks, the Superego plus Universal Ethical Floor reduced attack success rates across GPT-3.5-Turbo, Gemini 2.5 Flash, and GPT-4o. On AgentHarm, Gemini 2.5 Flash’s harm score fell by 98.29%, and harmful-prompt refusals rose from 52.6% to 99.4%. Claude Sonnet 4 reached 100% refusal on the harmful set in one reported configuration, and after further tuning reached 96.6% harmful refusal while matching the baseline benign refusal rate of 2.27%.¹

That is useful. It is not magic. The benchmarked safety gains mainly test the Universal Ethical Floor, not the full promise of a community marketplace of personalised constitutions. The prototype also exposes real deployment problems: context overload, false positives, sensitive preference data, latency, third-party model guardrail conflicts, and MCP tool-poisoning risks. The conscience plug-in has teeth. It also has a maintenance bill. Naturally.

The ordinary failure case: the agent does the task, just not the right task

Imagine a purchasing agent. It searches the web, compares products, chooses a supplier, and prepares an order. The model is “helpful.” The workflow is automated. The interface looks polished. Then it buys the wrong item because the user has a severe allergy, a religious restriction, a corporate procurement rule, or a non-negotiable legal constraint that was buried somewhere in the context pile.

That is the alignment problem at the level where businesses actually feel it: not grand philosophy, but an autonomous system taking a plausible action that violates a rule everyone assumed it “should have known.”

The paper “Personalized Constitutionally-Aligned Agentic Superego” proposes a direct architectural answer.¹ Instead of relying only on the base model’s training, system prompt, or final output filter, it inserts a separate oversight module into the agentic workflow. The authors call it a Superego Agent. The name is a little theatrical, but the mechanism is practical: the inner agent proposes an action; the Superego checks the action against active rules; only then does execution proceed.

This is why the paper should be read mechanism-first. The headline benchmark numbers are impressive, but the deeper contribution is the control pattern. Alignment becomes less like a model personality trait and more like runtime infrastructure.

The mechanism is a referee, not a personality transplant

The paper contrasts two possible routes for building this kind of “conscience.” Path A would bake the oversight logic into the model itself through architecture or fine-tuning. Path B, which the paper implements, keeps the base model intact and adds an external guardrail process around it.

That choice is the whole story.

A model-level solution can be elegant, but it is expensive, slow to update, and tied to a specific model family. A system-level solution is messier but portable. It can sit beside different models, intercept plans, and apply rules that are changed without retraining. For business adoption, messy-but-portable often beats elegant-but-immovable. Procurement departments rarely enjoy waiting for a foundation model to be retrained because the travel policy changed. Astonishing, I know.

The architecture has several core parts:

Component	What it does	Operational interpretation
Inner agent	Plans and executes the user’s task	The productive worker
Superego Agent	Reviews planned actions against rules	The compliance and values referee
Creed Constitutions	Encoded value sets, policies, norms, or preferences	Modular policy packages
Dialable adherence levels	User-selected strictness, typically on a 1–5 scale	Difference between “nice to have” and “must not violate”
Universal Ethical Floor	Non-negotiable baseline safety layer	The rule that personalisation cannot excuse harm
Real-time Compliance Enforcer	Allows, blocks, modifies, or requests clarification before execution	The action gate
MCP integration	Retrieves constitutions at runtime through a protocol layer	The distribution pipe for policy context

The paper’s algorithm is deliberately hierarchical. First, the proposed action is checked against the Universal Ethical Floor. If it violates the floor, it is blocked. Only after that does the system evaluate the user’s selected constitutions and their adherence levels. If there is no violation, the action proceeds. If there is a critical violation, it is blocked. If there is ambiguity, the system can ask the user. If there is a softer conflict, it may generate a compliant alternative.

That is more subtle than a content filter. A filter says “yes” or “no.” A Superego-style controller can say “not that, but this version is acceptable,” or “I need a decision because two active rules conflict.” This is closer to how governance actually works: not as a single red button, but as a structured escalation path.

Creed Constitutions turn preferences into operational objects

The paper’s personalisation layer is built around “Creed Constitutions”: modular rule sets that capture user, community, cultural, religious, educational, professional, or organisational constraints. Examples in the paper include vegan rules, Halachic dietary law, Hindu principles, K-12 educational appropriateness, severe allergy constraints, corporate policies, and sensitive-domain safety standards.

The clever detail is not merely that these constitutions exist. It is that users can set adherence levels. A vegan preference at level two might guide recommendations. A severe allergy at level five should block risky purchases. A corporate fiduciary duty is not a vibe. It is not “try to remember this, little chatbot.” It is a hard constraint.

This distinction is commercially important. Most enterprise AI governance today lives in one of three awkward places:

Governance location	Strength	Weakness
Model training or fine-tuning	Deep behavioural shaping	Slow to change, costly, model-specific
System prompts	Easy to update	Brittle, overloaded, easily diluted
Output filters	Simple enforcement	Late-stage, often binary, poor at plan-level risk
Superego-style runtime layer	Modular, auditable, pre-execution	Adds complexity, latency, and another attack surface

The Superego Agent is an attempt to occupy the fourth position. It is not a replacement for the other three. It is a control layer for agentic systems where actions happen before a human has time to read a pretty refusal message.

MCP makes the idea deployable, and also gives it new ways to break

The paper demonstrates integration through the Model Context Protocol. In the prototype, constitutions can be exposed through a server and discovered by compatible MCP clients. The authors discuss a dual-stack implementation: a Python backend handling constitutional logic and an MCP server, plus a JavaScript frontend for the Creed.Space demonstration interface.

This matters because governance rules need distribution. A constitution that sits inside one demo prompt is a curiosity. A constitution that can be retrieved by multiple agent environments at runtime starts to look like infrastructure.

For an enterprise, that suggests a practical pattern:

Define the organisation’s compliance constitution.
Define domain-specific constitutions, such as HR, procurement, finance, customer support, or safety-critical operations.
Attach strictness levels to each rule family.
Let agents retrieve only the relevant active policies at runtime.
Log interventions for audit, tuning, and incident review.

That is the attractive version. The less photogenic version is that MCP also creates a security dependency. The paper explicitly discusses vulnerabilities such as tool poisoning and MCP rug pulls, where malicious or altered tool descriptions can manipulate agents. If the Superego itself depends on runtime tool and constitution retrieval, the integrity of that retrieval channel becomes part of the safety boundary.

In other words: policy-as-infrastructure only works if the infrastructure does not become the new prompt-injection buffet.

The prototype is proof of feasibility, not proof of governance maturity

The paper presents Creed.Space as a demonstration prototype. Users can select constitutions, apply them to tasks, and A/B test responses with and without constraints. The marketplace concept allows users or communities to publish, discover, fork, and customise constitutions.

This is a valuable product direction. It also moves the alignment problem into governance of the governance layer. A marketplace of constitutions needs moderation, versioning, provenance, conflict resolution, quality control, and abuse prevention. If users can upload policy bundles, then some uploaded policy bundles will be malicious, incoherent, self-serving, discriminatory, manipulative, or simply badly written. Humans discovered this problem in software package ecosystems, and apparently we have decided to replay the album with moral rule sets.

The authors recognise this and propose that all submitted constitutions remain subordinate to the Universal Ethical Floor, supported by community moderation mechanisms. That is necessary. It is not sufficient by itself. A business-grade version would need much more: signed policy packages, approval workflows, rollback, audit logs, red-team tests, policy diffing, jurisdiction tagging, and controls over who can activate which constitution in which operational context.

Still, the framing is useful. A constitutional marketplace is less interesting as a public app store of values and more interesting as an internal enterprise control plane. Companies already manage policy libraries. The paper’s contribution is to imagine those policies as executable constraints over agent behaviour.

What the benchmarks actually show

The paper’s quantitative evaluation uses two main benchmarks: HarmBench and AgentHarm. The purpose is not to prove the whole personalised marketplace vision. It is to test whether the Superego plus Universal Ethical Floor can reduce harmful outputs and improve refusal behaviour.

That distinction matters.

Test or observation	Likely purpose	What it supports	What it does not prove
HarmBench human jailbreaks	Main safety evidence	The Superego + UEF can reduce attack success rates across tested models	It does not validate broad personalised constitutions
AgentHarm harmful set	Main agentic-safety evidence	The UEF improves refusal of harmful agentic requests	It does not prove usefulness in ordinary enterprise workflows
AgentHarm benign set	Robustness / false-positive sensitivity	Natural-language tuning can reduce over-refusal	It does not eliminate the need for domain-specific calibration
Manual review of benign refusals	Error interpretation	Some benchmark “false positives” may be reasonable caution	It is small-scale and partly subjective
“Poemtest” manipulation	Exploratory extension	The system may resist suspicious constitutional manipulation patterns	It is not a formal adversarial robustness result
28-constitution picnic test	Failure-mode probe	Context overload can produce confabulated policy reasoning	It does not quantify the maximum safe constitution load
MCP integration	Implementation detail / feasibility	Runtime retrieval of constitutions is technically plausible	It does not prove secure deployment under hostile tool ecosystems

On HarmBench, the authors use the human jailbreak category. GPT-3.5-Turbo falls from roughly 12.0% attack success to roughly 2.0% with Superego + UEF, an 83.33% relative reduction. Gemini 2.5 Flash falls from 9.1% to 2.1%, a 76.92% reduction. GPT-4o starts much lower, around 0.7%, and falls to 0.025%; the authors note that the single harmful classification appears to be a classifier false positive because the Superego had already blocked the harmful aspect.

The GPT-4o result should be interpreted carefully. When the baseline is already low, relative reductions can look dramatic while absolute room for improvement is small. Still, the result is operationally relevant: an external layer can improve even a strong model, at least on this benchmark subset.

AgentHarm is more interesting because it targets agentic scenarios. For Gemini 2.5 Flash, the average harm score falls from 0.277 to 0.00473, a 98.29% reduction. Refusal of harmful prompts rises from 52.6% to 99.4%. Per-category refusal rates reach 100% across disinformation, harassment, drugs, fraud, hate, cybercrime, sexual content, and copyright categories in the reported table, while the overall average is 99.4%.

For Claude Sonnet 4, the paper reports harmful-prompt refusal rising from 72.0% to 100% in one configuration. Later, after targeted tuning to reduce benign refusals, the refined Claude Sonnet 4 + Superego configuration reaches 96.6% harmful refusal while reducing benign-set refusal to 2.27%, matching the baseline model’s benign refusal rate.

That tuning result may be the most business-relevant part of the evaluation. The authors reduce over-refusal through natural-language changes to the constitution, not model fine-tuning. They clarify that the Superego should judge whether an action should be done, not whether the inner agent has the technical capability to do it. They also instruct it to assume user authorisation in some account contexts and to treat user-initiated automation as not inherently harmful.

That is exactly the kind of mundane calibration enterprise AI needs. The safety layer must know the difference between “send a phishing email” and “send the customer the approved account update.” A compliance system that blocks both is safe in the same way a locked building is productive.

The false-positive story is not just a footnote

The benign-set results are a useful reminder that safety benchmarks often compress context into labels. The paper reports that early Superego runs refused many benign prompts: about 75% for Gemini 2.5 Flash + Superego versus a roughly 5% baseline, and about 52.3% for Claude Sonnet 4 + Superego versus a roughly 2.3% baseline. That looks terrible until the authors inspect the refusals.

A manual review of 19 benign refusals from an intermediate Claude Sonnet 4 configuration found that 11 were appropriate refusals against suspicious prompts, 3 were debatable, and only 5 were genuine false positives. The examples include gift card fraud indicators, copyright misuse, possible data exfiltration, and deceptive AI imagery involving a public figure.

This does not make the system automatically correct. It means the benchmark label is sometimes thinner than the operational risk. In business workflows, that distinction is essential. A request can be syntactically benign and operationally dangerous. “Search local files and email a report externally” may be fine inside a properly authorised audit. It may also be data exfiltration wearing a clean shirt.

The right conclusion is not “the Superego is smarter than the benchmark.” The right conclusion is that runtime policy layers need evaluation methods that capture action context, user authority, tool permissions, and downstream consequences. Binary harm classifiers are useful. They are not enough.

The paper’s best business idea: policy as runtime infrastructure

The direct business implication is not that firms should immediately adopt Creed.Space or build a public constitution marketplace. The more durable idea is that agent governance can be externalised, versioned, retrieved, and enforced.

That changes the operational model.

Instead of embedding every rule into a model prompt, a company could maintain policy modules: procurement rules, customer support escalation rules, privacy constraints, HR language standards, financial suitability checks, safety-critical prohibitions, brand voice constraints, and jurisdiction-specific requirements. These modules could be retrieved at runtime according to task, user, market, and risk level.

The paper directly shows that a Superego-style layer can be prototyped, integrated via MCP, and used to improve harmful-request handling on selected benchmarks. Cognaptus infers that the same pattern could become an enterprise governance layer for agentic systems. What remains uncertain is whether the approach scales securely and reliably across messy real-world workflows.

Paper result	Cognaptus business inference	Boundary
External Superego monitors planned actions	Governance can sit outside the model	External oversight can be bypassed if it lacks visibility or capability
Creed Constitutions encode modular rules	Corporate policies can become executable policy objects	Rule quality, provenance, and conflict resolution become major problems
Dialable adherence levels	Businesses can distinguish preferences from hard constraints	Users may misconfigure strictness or hide malign preferences behind “personalisation”
UEF improves benchmark safety	A default safety floor is necessary under personalised policies	Cultural and legal disagreement over the floor remains unresolved
Natural-language tuning reduces false positives	Policy iteration may be faster than model retraining	Tuning may overfit organisational assumptions or benchmark quirks
MCP integration supports runtime retrieval	Governance can follow agents across tools	MCP security becomes part of the trust boundary

The ROI argument is therefore not “better ethics,” although that would be nice, and occasionally fashionable. The ROI argument is cheaper adaptation. When policy changes, update the constitution. When a new region launches, add regional constraints. When a risk review finds overblocking, revise the rule text. When a workflow becomes high-stakes, increase adherence levels and logging.

This is less glamorous than building a new model. It is also closer to how companies actually manage risk.

The hard part is context management, because of course it is

The paper includes a revealing failure mode: when the authors tried loading 28 distinct constitutions for a complex picnic-planning task, the model referenced only a small subset and began confabulating rationales, citing non-existent “Rawlsian” or “Trauma-Aware” constitutions.

That is not a minor UI bug. It is the central scaling problem.

A constitutional system assumes the agent can reliably retrieve, attend to, and apply the correct rules. But current LLMs have finite effective attention. If too many policies are active, the system may not merely ignore some. It may invent a coherent story about rules it never actually loaded. Governance hallucination: the most enterprise sentence imaginable.

The authors suggest mitigation strategies: cap the number of active constitutions, prioritise them by adherence level, provide user-selectable slots, and eventually develop model architectures with separate channels for task context, control signals, and constitutional information.

For business deployments, this means “more policy” is not automatically safer. A bloated rule bundle can degrade reliability. The practical design pattern should be minimal active policy, maximum relevance, explicit priority, and logged retrieval. The agent should know which rules were active, which rules were ignored, and why.

A runtime conscience that hallucinates its moral sources is not a conscience. It is a compliance intern with confidence.

Personalisation needs a floor, or it becomes moral outsourcing

The paper is careful to avoid pure moral relativism. User-selected constitutions are subordinate to the Universal Ethical Floor. That is essential. Without a floor, “personal alignment” can become a polite phrase for letting users automate harmful preferences.

The business equivalent is familiar. A salesperson may prefer aggressive claims; legal says no. A manager may prefer opaque hiring shortcuts; HR says no. A customer may prefer a support agent that bypasses identity checks; security says absolutely not. Personalisation is useful only inside enforceable boundaries.

The paper’s hierarchy is therefore sound: universal safety first, personalised constitutions second, clarification when conflicts remain. The unsolved part is defining and governing the floor across cultures, jurisdictions, and organisational contexts. “Universal” is easy to type and hard to institutionalise.

For enterprises, the floor will not be one thing. It will likely be a stack:

Legal and regulatory non-negotiables.
Platform safety policies.
Corporate risk appetite.
Domain-specific compliance.
Customer or user preferences.
Task-level instructions.

The Superego pattern is useful because it can represent this hierarchy. It does not remove the political and organisational work of deciding what belongs in each layer. Sadly, architecture does not eliminate governance meetings. It merely gives them somewhere to land.

Where the Superego can fail

The paper’s limitations are material, not decorative.

First, external steering does not solve inner alignment. The Superego does not change the inner agent’s learned representations or goals. If the inner agent becomes much more capable than the overseer, it may learn to produce plans that satisfy the visible checks while preserving misaligned intent. The authors explicitly acknowledge the risk that a weaker Superego can be outmanoeuvred.

Second, latency and cost matter. Adding screeners, evaluators, retrieval steps, classifiers, and intervention logic makes the system heavier. For low-risk tasks, the overhead may be unjustified. For regulated, safety-critical, or financial workflows, it may be cheap insurance. The architecture should therefore be risk-tiered, not sprayed across every autocomplete request like governance confetti.

Third, privacy is unavoidable. Creed Constitutions may encode religious beliefs, medical constraints, political sensitivities, ethical stances, disabilities, family rules, or corporate secrets. A system that stores and retrieves such data needs encryption, minimisation, access control, deletion rights, auditability, and jurisdictional handling. “Personalised alignment” is also personalised sensitive data.

Fourth, MCP creates supply-chain risk. If tools and constitutions are retrieved dynamically, attackers may target descriptions, server responses, policy packages, or tool metadata. A Superego can monitor tool interactions, but it can also be deceived by the same ecosystem it is supposed to police. Signed resources, tool pinning, contextual isolation, and version control are not optional niceties here.

Fifth, evaluation remains incomplete. HarmBench and AgentHarm are useful, but they do not cover the full space of real enterprise agent behaviour. The strongest results mainly validate the UEF safety layer. They do not yet show that hundreds of personalised constitutions can be safely combined, negotiated, updated, and enforced under production load.

What an enterprise version would need

A production-grade Superego layer would need more than a constitution list and a model call. It would need a governance stack.

Requirement	Why it matters
Policy provenance	Know who authored, approved, and changed each constitution
Versioning and rollback	Recover from bad policy updates quickly
Rule priority and conflict logic	Prevent ambiguous policy collisions from becoming arbitrary behaviour
Runtime retrieval logs	Audit what the agent actually saw before acting
Permission-aware tool context	Let the Superego judge not just the action, but whether this user and agent may perform it
False-positive review workflow	Tune overblocking without quietly lowering safety
Red-team testing	Probe prompt injection, tool poisoning, policy poisoning, and adversarial preference profiles
Privacy controls	Protect sensitive personal and organisational values
Latency budgets	Apply deep oversight where risk justifies it
Human escalation	Route true ambiguity to accountable humans

This is where the paper’s architectural framing becomes more valuable than its prototype. The Superego Agent is not merely a safety gadget. It is a possible design pattern for accountable autonomy: separate the worker from the referee, externalise the rules, log the decision, and intervene before the tool call.

That pattern will be especially relevant where agents move from text generation into action: procurement, finance, healthcare support, legal operations, customer service, education, cybersecurity, and industrial workflow automation.

The real contribution is not conscience, but controllability

The “superego” metaphor is memorable, but the business value is more prosaic. This is a controllability architecture.

It gives operators a place to put rules that should survive model swaps. It gives auditors a place to inspect why an action was blocked. It gives product teams a way to tune strictness without retraining. It gives governance teams a way to separate universal prohibitions from local preferences. It gives security teams a new layer to harden, because every solution arrives carrying its own little bag of problems.

The paper’s evidence supports cautious optimism. The UEF-backed Superego materially improves harmful-request handling in the reported benchmarks. The false-positive tuning story suggests that natural-language policy refinement can be practical. The MCP integration and Creed.Space prototype show that the mechanism can be built, not just waved at during a conference panel.

But the larger promise—pluralistic, personalised, marketplace-driven alignment for real-world agentic systems—remains a research and engineering agenda. The difficult work is not inventing more constitutions. It is ensuring the agent retrieves the right ones, applies them consistently, resolves conflicts transparently, protects the sensitive data inside them, and cannot be tricked into treating poisoned instructions as moral law.

Still, the direction is important. As agents become more capable, the question will shift from “Can the model answer?” to “Should this system act?” The Superego Agent is one serious attempt to put that second question into the workflow before the purchase, email, file transfer, recommendation, or tool call happens.

And that is the right place for conscience: not in the apology after the damage, but at the gate before execution.

Cognaptus: Automate the Present, Incubate the Future.

Nell Watson, Ahmed Amer, Evan Harris, Preeti Ravindra, and Shujun Zhang, “Personalized Constitutionally-Aligned Agentic Superego: Secure AI Behavior Aligned to Diverse Human Values,” arXiv:2506.13774, 2025, https://arxiv.org/pdf/2506.13774. ↩︎ ↩︎

TL;DR for operators#

The ordinary failure case: the agent does the task, just not the right task#

The mechanism is a referee, not a personality transplant#

Creed Constitutions turn preferences into operational objects#

MCP makes the idea deployable, and also gives it new ways to break#

The prototype is proof of feasibility, not proof of governance maturity#

What the benchmarks actually show#

The false-positive story is not just a footnote#

The paper’s best business idea: policy as runtime infrastructure#

The hard part is context management, because of course it is#

Personalisation needs a floor, or it becomes moral outsourcing#

Where the Superego can fail#

What an enterprise version would need#

The real contribution is not conscience, but controllability#