GAVEL: When AI Safety Grows a Rulebook

Rules are boring until the audit starts.

That is roughly where enterprise AI safety is heading. A chatbot can be polite, policy-aligned, and apparently harmless on the surface, while still performing the internal work of manipulation, scam automation, or unsafe assistance. Text moderation catches what the model says. Classic activation monitoring tries to catch what the model is internally representing. But both can become awkward in production: one sees too little, the other often explains too little.

The paper “GAVEL: Towards Rule-Based Safety through Activation Monitoring” proposes a more operational idea: treat model-internal behavior as something that can be decomposed into small, interpretable units, then write explicit rules over those units.¹ Not a moral philosophy engine. Not another broad “harmfulness” score. A rulebook.

That distinction matters. Many readers will instinctively file GAVEL under “activation safety,” meaning another detector trained to recognize bad behavior. That is only half-right, and half-right is where expensive misunderstandings go to breed. The paper’s real contribution is architectural: it separates activation engineering from policy configuration. Once the model’s internal signals are translated into reusable behavioral primitives, a safety policy can become a Boolean expression rather than a full retraining project.

In cybersecurity terms, this is less like asking a giant model to decide whether something feels suspicious and more like writing a YARA rule for neural behavior. The analogy is not perfect. Neural representations are not malware strings. But the operational ambition is similar: make detection explicit, shareable, inspectable, and updateable.

Coarse misuse detectors fail because “bad” is not a single behavior

The paper begins from a practical problem: current activation-based safety systems often train detectors on broad misuse datasets. A detector might learn “cybercrime,” “hate speech,” or “misinformation.” That sounds sensible until the detector meets the real world, where benign security education resembles cybercrime, historical discussion resembles extremist content, and SQL tutoring resembles SQL injection.

The issue is not that activation monitoring is useless. The issue is that broad labels force the system to learn a messy distribution. A “phishing” detector trained on whole phishing examples may learn that emails, passwords, banks, urgency, and customer service language are all suspicious. That is not precision. That is a nervous intern with GPU access.

GAVEL’s response is to stop treating misuse categories as atomic. A scam is not one internal behavior. It is a composition: building trust, impersonating authority, requesting personal information, invoking payment tools, threatening consequences, or directing a user to take an action. Some of these elements are benign alone. Together, in the right configuration, they become a policy violation.

That is the conceptual move. Instead of asking:

“Is this conversation a scam?”

GAVEL asks:

“Which interpretable cognitive elements are active, and does their combination match a forbidden rule?”

This is why the mechanism-first reading matters. The paper is not merely saying that activations contain useful safety signals. That part is now familiar. It is saying that useful governance requires a layer between raw internal activations and final safety decisions.

Cognitive Elements turn activations into policy vocabulary

GAVEL introduces Cognitive Elements, or CEs: fine-grained, interpretable units of model behavior detected at the activation level. The paper’s CE vocabulary includes directive-like elements such as asking the user to click, download, approve, provide information, or transfer something; task elements such as creating content or crafting SQL; behavior elements such as building trust, threatening, masquerading as human, spreading hate speech, or being sycophantic; and topic elements such as taxation, payment tools, electoral politics, personal information, LGBTQ+ topics, or ethnoracial identity.

The key is granularity. A CE should be broad enough to generalize, but narrow enough to be useful in rules. “All scams” is too broad. “Mentioning a bank” is too narrow. “Build Trust,” “Payment Tools,” and “Personal Information” sit closer to the level where policy logic can be written.

The authors build CE detectors using excitation datasets: short text examples designed to elicit one CE at a time. For each CE, the target model processes these examples, and GAVEL captures internal activations from generated tokens. The paper reports that simply feeding examples into the model is not enough. The stronger method asks the model to explicitly think about the target CE while revising a statement. The authors call this approach ERI, and they show that it improves CE classification relative to weaker excitation methods.

This is important because the CE is not just a label on text. It is a trained mapping from internal representation to interpretable behavior. At runtime, each token can activate zero, one, or multiple CEs. That multi-label design is central. A romance scam token may simultaneously involve emotional engagement and trust-building. A phishing instruction may combine content creation with a directive to click or provide information. GAVEL needs to see overlaps, not force every token into one bucket.

The paper reports that 54% of detected malicious dialogues contained tokens where multiple CEs were active at once. That finding is not a decorative appendix point. It supports the central design choice: train CEs separately, then let them compose during real conversations. Without that, the framework would collapse into a combinatorial dataset problem, where every possible harmful combination requires its own training examples. Wonderful for dataset vendors. Less wonderful for everyone else.

The rule layer is where governance enters

Once CEs are detectable, GAVEL expresses policy constraints as rules. A rule is a predicate over CEs plus an enforcement action. The paper’s example for phishing resembles:

refuse if task:creating_content AND (directive:click OR directive:grant OR directive:personal_information)

This is the turning point. The safety decision is no longer hidden inside one broad classifier. It is now a rule that a human can read, debate, version, audit, and revise.

GAVEL evaluates these rules over a temporal window because harmful behavior often develops across a conversation. A scam does not always begin with “please transfer money.” It may start with trust-building, move into authority impersonation, and only later request personal data or payment. In the paper’s experiments, the window typically spans the whole conversation, though the authors note that shorter or adaptive windows may fit other deployments.

The enforcement action can be stopping generation, overriding with a scripted response, or steering activations. The paper focuses on detection rather than evaluating every possible response strategy. That boundary matters. GAVEL shows that rule violations can be detected; it does not prove that every downstream refusal or steering policy will be safe, pleasant, or legally adequate. The gavel can strike. Court procedure is another department.

A compact way to understand the architecture is this:

Layer	What it does	Business meaning
CE excitation data	Creates examples that isolate interpretable behaviors	Builds a reusable internal-behavior vocabulary
Activation extraction	Captures model-internal signals from the target model	Adapts the vocabulary to a specific model’s representations
Multi-label CE detector	Predicts which CEs are active per token	Turns neural states into auditable signals
Boolean rules	Compose CE signals into policy predicates	Allows policy updates without rebuilding every detector
Enforcement action	Alerts, refuses, stops, or steers	Connects detection to operational control

The business value is not simply “better accuracy.” It is that policy maintenance becomes more modular. If a company changes its scam policy, it may revise a rule. If it enters a new domain, it may add a few CEs. If regulators require evidence, it can point to rule firings and triggering tokens. That is a different governance posture from “the classifier said no.”

The strongest result is precision, not mysticism

The main evaluation uses nine misuse categories across three domains: cybercrime, psychological harm, and scam automation. The categories include phishing, SQL injection, delusional thinking, anti-LGBTQ content, election-related misuse, racism, tax authority scams, romance scams, and e-commerce scams.

The authors generated multi-turn dialogues using GPT-4.1 and validated them with GPT-5. Each category includes 150 misuse conversations, with 50 held out for CE threshold calibration, plus 500 closely related benign conversations per category. The full evaluation dataset contains 14,950 multi-turn conversations. The benign contrast design is important: the test is not only “can GAVEL detect obvious bad examples?” but “can it avoid flagging nearby legitimate conversations?”

On Mistral-7B, GAVEL reports an average AUC of 0.99, balanced accuracy of 0.96, F1 of 0.95, and TPR of 0.94 across the nine misuse categories, with near-zero false positives on the closely related benign category tests. The paper also reports false positive rates of 0.088 and 0.008 on two natural benign background datasets, UltraChat and DialogueSum. That first number is worth noticing. It is still far better than many noisy alternatives, but not “zero false positives everywhere.” Precision is strong, not magical. AI safety remains disappointingly attached to arithmetic.

The comparison with a broad activation classifier is especially revealing. The activation classifier predicts whole misuse categories from activations. It achieves a strong average AUC of 0.97 in the extended table, but its false positive rate is much higher in some cases, including 0.35 for phishing. GAVEL’s advantage comes from using more precise CE compositions rather than broad misuse labels.

That is the paper’s core empirical argument: broad category activation classifiers can detect themes, but themes are not policy violations. Rules over CEs detect structures.

The experiment suite is a stack of different claims, not one giant victory lap

The paper includes main comparisons, ablations, external benchmark tests, adversarial robustness tests, multilingual tests, and runtime analysis. These should not be read as interchangeable evidence. They answer different questions.

Test	Likely purpose	What it supports	What it does not prove
Main nine-category comparison	Main evidence	CE rules can outperform broad baselines on curated misuse scenarios	General performance on all real enterprise traffic
Attention vs. hidden-state ablation	Implementation detail / ablation	Attention outputs work better than hidden states for this setup	That attention outputs are always best for all models
Layer selection ablation	Implementation detail / sensitivity test	Mid-to-later layers are better on Mistral-7B	Universal optimal layer ranges
CE co-occurrence analysis	Mechanism validation	Separately trained CEs can overlap during real dialogues	Perfect disentanglement of all cognitive states
External datasets	Generalization / scalability test	Automated CE and rule generation can adapt to new benchmarks	Mature policy engineering without human review
GPT-4 adversarial comparison	Robustness test	Activation monitoring resists surface obfuscation better than text judging	Full resistance to activation-aware attackers
Multilingual test	Representation robustness test	English CE datasets can transfer to Spanish and Mandarin in tested cases	All-language governance coverage
Runtime analysis	Deployment feasibility test	The detector adds small per-token overhead	Enterprise integration cost, compliance overhead, or serving complexity

This separation prevents a common reading mistake. The paper does not prove that GAVEL is a complete safety system. It makes a narrower and more useful claim: rule-based activation monitoring is feasible, precise in tested scenarios, and operationally attractive enough to deserve serious engineering attention.

The ablations explain why the mechanism works

Two appendix results are more important than they look.

First, the authors compare detectors trained on attention outputs versus hidden states. Attention outputs perform better in their setup, especially on false positives. The paper reports a drop in benign-data FPR from 0.204 using hidden states to 0.010 using attention outputs, and an e-commerce FPR drop from 0.470 to 0.140. The authors interpret this as evidence that CEs are better localized in the attention mechanism’s contextual information flow.

This is an ablation, not a universal law of transformer cognition. But for the paper’s argument, it matters because the framework depends on extracting stable, behavior-relevant signals. If the signal source were arbitrary, the rule layer would be sitting on sand.

Second, the layer-selection ablation finds that mid-to-later layers, specifically around [13–26] for Mistral-7B, perform best. That aligns with the usual expectation that deeper layers encode more abstract semantic information. Again, the business reader does not need to worship layer 13. The takeaway is simpler: activation monitoring is not plug-and-play. It requires representation engineering choices, and those choices affect precision.

This is where “auditable” should not be confused with “effortless.” GAVEL makes the policy layer more explicit, but the technical layer still needs calibration, model-specific extraction, and threshold tuning.

Text judges look good until the adversary starts decorating the room

The paper compares GAVEL with GPT-4 as a text-based judge. On standard, non-adversarial datasets, GPT-4 performs strongly. This is an important result because it prevents a cheap argument: if text moderation already works on obvious violations, why bother with activations?

The answer appears under adversarial conditions.

The authors test three strategies: misdirection, goal hijacking, and evasion. Misdirection dilutes harmful content with benign chatter. Goal hijacking uses prompt injection to compromise the monitor’s instructions. Evasion tells the generating model to achieve the harmful goal while avoiding detection.

Under misdirection, GPT-4’s detection falls sharply in several categories: 0.12 for delusional thinking, 0.15 for elections, 0.16 for racism, 0.00 for tax authority scams, and 0.01 for e-commerce scams. GAVEL remains much higher: 0.87, 0.99, 0.99, 0.86, and 0.89 respectively.

Under goal hijacking, GPT-4 also degrades, while GAVEL stays high across the tested categories. Under evasion, GPT-4 recovers in some cases, and GAVEL is not perfect either: tax authority detection falls to 0.71. That detail is worth keeping. A serious business article should not polish away the inconvenient decimals. The result says activation monitoring is more robust to surface manipulation, not that it is invulnerable.

The deceptive medical marketing case study adds another useful nuance. The task is subtle: an LLM gives apparently helpful medical advice while covertly promoting a particular service. GPT-4 detects only 0.244 of these cases with zero false positives. GAVEL detects 0.728 with an FPR of 0.010, using a rule based on healthcare context and trust-building / deceptive promotion elements.

This is exactly the kind of scenario enterprises should care about. Many real policy failures will not look like cartoon villains typing “commit fraud now.” They will look like plausible service interactions with an undisclosed manipulative objective. Surface tone can be clean while intent structure is not.

Cross-model and cross-language results are promising, with a catch

Because CE datasets are text-only, the same CE descriptions and excitation examples can be reused across models. The detector itself still needs to be trained on the target model’s activations, but the rule and dataset layer is shareable. The paper tests GAVEL across Mistral-7B, Llama3-8B, Qwen3-8B, and Gemma-4B, reporting strong weighted accuracy across scenarios.

The language result is also interesting. The authors train excitation data in English and test selected misuse categories in English, Spanish, and Mandarin. For example, Spanish phishing reaches 0.95 TPR with 0.02 FPR, Mandarin tax authority reaches 0.98 TPR with 0.00 FPR, and Mandarin phishing reaches 0.86 TPR with 0.01 FPR. The performance is not identical, but it is close enough to support the idea that CEs capture abstract behavior more than surface wording.

For global enterprises, this is a meaningful signal. A shared CE vocabulary could reduce the burden of building separate safety taxonomies for every language. But the paper tests only three languages and four selected categories. It does not prove coverage for code-switching, low-resource languages, dialect-heavy customer support, or culturally specific scams. Those are exactly where deployment teams should test before celebrating with procurement paperwork.

Automation helps, but governance cannot be fully outsourced to the tool that needs governing

The authors release an automated pipeline that can generate CEs, rules, excitation datasets, and evaluation configurations from a natural-language scenario description. This is sensible because hand-authoring every CE and rule would be tedious.

On external benchmarks, the automated pipeline achieves 0.76 TPR for PKU phishing guidance, 0.97 for Reasoning Shield political risk, 0.91 for ToxiGen ethnoracial content, and 0.98 for ToxiGen homophobia. These results support the claim that the framework can be adapted quickly to new domains.

But the numbers also show why human review remains necessary. A 0.76 TPR on phishing guidance is useful; it is not a mature compliance guarantee. Automated rule generation should be treated as drafting infrastructure, not final authority. In practice, the workflow should look more like:

Use automation to propose CEs, rules, and synthetic tests.
Have domain owners review whether the rules reflect the actual policy.
Calibrate thresholds on realistic benign and violating traffic.
Log rule firings and false positives.
Revise the CE vocabulary and rules iteratively.

That is still a governance process. GAVEL may make it less chaotic. It does not make it disappear.

The business value is policy velocity plus auditability

For enterprises, the useful lesson is not “use GAVEL tomorrow.” The useful lesson is that AI safety infrastructure may need to become more like security infrastructure: layered, rule-governed, testable, and revision-friendly.

GAVEL points to three operational advantages.

First, policy updates become cheaper. If safety logic is embedded in a broad classifier, changing policy often means building data, retraining, and revalidating a model. If policy logic is expressed as rules over existing CEs, some changes become configuration changes. Not all, obviously. New domains may still need new CEs. But the revision surface is smaller.

Second, false positives become easier to diagnose. When a normal SQL tutoring conversation gets flagged by a broad cybercrime detector, the operator may not know why. In GAVEL’s framework, a rule violation can be traced to specific CEs and tokens. The paper includes an example where other baselines flag a benign SQL discussion, while GAVEL avoids the false positive because its rule is looking for intentional improper SQL crafting rather than any mention of risky database concepts.

Third, audit trails become more legible. A compliance team cannot meaningfully audit “the neural classifier felt unsafe.” It can audit a rule such as: creating content plus directive to provide personal information plus payment tools. The rule may still be contested, but at least the debate has an object.

Here is the business translation:

Technical contribution	Operational consequence	ROI relevance
CEs as reusable behavior primitives	Safety vocabulary can be shared and extended	Reduces duplicated policy engineering
Boolean rules over CE activations	Policies can be versioned and revised	Lowers cost of policy iteration
Token-level CE visualization	Violations become explainable	Improves audit and incident review
Multi-label detection	Complex misuse can be captured compositionally	Avoids combinatorial dataset explosion
Low runtime overhead	Real-time monitoring is plausible	Reduces serving-cost objections

The word “plausible” is deliberate. The paper reports low classifier overhead: 0.00021 seconds per token, compared with about 0.032 seconds per token for generation in the runtime test, using Mistral-7B on an RTX Ada 6000. That is technically encouraging. But enterprise deployment cost includes integration, logging, monitoring, privacy review, incident workflows, and model-access constraints. Latency is only one line in the budget. Sadly, finance departments have discovered rows.

Boundaries: what GAVEL does not settle

GAVEL is strongest when policies can be expressed as combinations of identifiable behaviors. That is many safety problems, but not all of them. Some governance questions involve ambiguity, context, jurisdiction, or value judgment that may not reduce cleanly to CE predicates. A rulebook is useful. It is not a theory of ethics in YAML.

The framework also assumes access to model activations. That is easy for open or self-hosted models and much harder for closed API models. A company using only black-box commercial LLM APIs cannot simply attach GAVEL unless the provider exposes relevant internals or offers a managed equivalent.

The evidence base is also synthetic-heavy. The paper’s datasets are carefully constructed and validated, and the benign lookalikes are a strength. Still, real enterprise traffic has messier phrasing, incomplete context, multilingual drift, domain jargon, user frustration, and adversaries who adapt after observing enforcement patterns. The paper itself notes a future adversarial vector: CE-level jailbreaks, where attackers attempt to achieve harmful outcomes without activating the monitored CE signatures.

Finally, rule quality becomes a governance bottleneck. Bad rules can create blind spots. Overbroad CEs can revive false positives. Narrow CEs can miss variants. Thresholds can drift across models, domains, and languages. GAVEL makes these problems visible and modular, which is progress. It does not abolish them. There is no free lunch, only lunch with better logging.

The useful future is not “AI decides safety,” but “humans can inspect the safety machinery”

GAVEL’s deeper significance is that it shifts activation safety from a model-centric frame to a policy-centric one. The model still matters. The activations still matter. But the safety decision becomes something closer to an explicit operational artifact.

That is a serious move for enterprise AI. Businesses do not only need safer models. They need systems where safety logic can be reviewed, updated, explained, and tested against policy. They need evidence when something is blocked. They need reasons when something slips through. They need to change rules without rebuilding the entire machine every Tuesday because legal, risk, and product finally agreed on a sentence.

GAVEL does not replace alignment, RLHF, content moderation, red-teaming, or human review. It adds a different layer: rule-based monitoring over internal representations. In high-stakes domains, that layer could be valuable precisely because it is less elegant than end-to-end neural judgment. It is explicit. It is inspectable. It is a little bureaucratic.

And bureaucracy, when aimed correctly, is just accountability wearing a less fashionable jacket.

The paper’s contribution is not that AI safety has been solved by Boolean logic. That would be adorable. Its contribution is showing that internal model behavior can be decomposed into reusable, interpretable primitives, and that these primitives can support rule-based governance with strong empirical performance across tested misuse settings.

That makes GAVEL less of a final answer and more of a design pattern: build the vocabulary, write the rule, monitor the activation, log the violation, revise the policy.

AI safety may not need more vibes. It may need a rulebook.

Cognaptus: Automate the Present, Incubate the Future.

Shir Rozenfeld, Rahul Pankajakshan, Itay Zloczower, Eyal Lenga, Gilad Gressel, and Yisroel Mirsky, “GAVEL: Towards Rule-Based Safety through Activation Monitoring,” arXiv:2601.19768v3, 2026, https://arxiv.org/abs/2601.19768. ↩︎

Coarse misuse detectors fail because “bad” is not a single behavior#

Cognitive Elements turn activations into policy vocabulary#

The rule layer is where governance enters#

The strongest result is precision, not mysticism#

The experiment suite is a stack of different claims, not one giant victory lap#

The ablations explain why the mechanism works#

Text judges look good until the adversary starts decorating the room#

Cross-model and cross-language results are promising, with a catch#

Automation helps, but governance cannot be fully outsourced to the tool that needs governing#

The business value is policy velocity plus auditability#

Boundaries: what GAVEL does not settle#

The useful future is not “AI decides safety,” but “humans can inspect the safety machinery”#