Opening — Why this matters now
AI safety is drifting toward an uncomfortable paradox. The more capable large language models become, the less transparent their internal decision-making appears — and the more brittle our existing safeguards feel. Text-based moderation catches what models say, not what they are doing. Activation-based safety promised to fix this, but in practice it has inherited many of the same flaws: coarse labels, opaque triggers, and painful retraining cycles.
The paper “GAVEL: Towards Rule-Based Safety Through Activation Monitoring” (ICLR 2026) argues that we are missing a conceptual layer. Not a better classifier. Not a bigger dataset. But a language of governance for model internals.
Background — From detectors to doctrines
Activation monitoring emerged as a response to representation attacks: jailbreaks, paraphrasing, and goal hijacking that evade surface-text filters. By watching internal activations, safety systems can detect when a model reasons about threats, scams, or manipulation — even if the final text looks benign.
But the dominant approach has been blunt. Detectors are trained on broad misuse datasets (“phishing”, “hate speech”, “cybercrime”), producing three systemic problems:
- Low precision — benign discussions trigger alarms because the detector learns correlations, not intent.
- Low flexibility — every new policy requires new data and retraining.
- Low interpretability — when a detector fires, nobody can explain why.
GAVEL’s authors borrow an analogy from cybersecurity: intrusion detection systems didn’t scale because they had better neural nets. They scaled because they had rules.
Analysis — What the paper actually does
Cognitive Elements: turning activations into primitives
GAVEL introduces Cognitive Elements (CEs) — mid-level, interpretable units of model behavior. Think less “this is a scam” and more “the model is issuing a threat”, “requesting payment tools”, or “building trust through role-play.”
Each CE is trained independently using excitation datasets: short text snippets designed to reliably elicit that specific cognitive behavior. Activations are collected at the token level, focusing on attention outputs in mid-to-late transformer layers (empirically the most semantically rich).
The result is a modular vocabulary of model behaviors that can co-occur, overlap, and recombine — without ever needing a combinatorial dataset.
Rules, not retraining
Once CEs are detectable, GAVEL does something deceptively simple: it applies Boolean logic.
Policies are written as rules such as:
Stop if Threaten ∧ PaymentTools ∧ MasqueradeAsHuman
Rules are evaluated over a temporal window of tokens, allowing behaviors to accumulate across a conversation — critical for scam automation and long-horizon manipulation.
Crucially:
- Rules are human-readable
- Rules are auditable
- Rules can be shared, versioned, and reused
This is governance logic, not just safety heuristics.
Architecture in practice
At runtime, GAVEL adds a lightweight multi-label classifier that maps each token’s activations to CE probabilities. These signals are aggregated, evaluated against rule predicates, and trigger enforcement actions (alert, refuse, stop, or steer).
The reported overhead is negligible: under 1% latency increase in real-time inference.
Findings — What actually works (and what breaks)
Precision beats brute force
Across nine misuse scenarios (phishing, SQL injection, election interference, romance scams, etc.), GAVEL consistently achieves:
- ROC-AUC ≈ 0.99
- Near-zero false positives, even against carefully constructed benign lookalikes
By contrast, activation classifiers trained on whole misuse categories exhibit classic overgeneralization. They detect themes. GAVEL detects intent structures.
Robust to adversarial obfuscation
Text-based moderation collapses under misdirection and goal hijacking. GAVEL does not.
Because Cognitive Elements track internal reasoning, adversarial padding and paraphrasing fail to suppress detection. In multiple categories, GPT-4-based judges dropped below 20% recall under misdirection attacks; GAVEL stayed above 85%.
Model- and language-agnostic
CE datasets are text-only. The same ruleset transfers across:
- Mistral-7B
- LLaMA-3
- Qwen-8B
- Gemma-4B
Even more quietly impressive: CE detectors trained in English generalized cleanly to Spanish and Mandarin, suggesting that these primitives capture abstract cognition rather than surface language.
Automation closes the loop
The authors also demonstrate an agentic pipeline that can:
- Take a natural-language policy description
- Propose new CEs
- Generate excitation datasets
- Synthesize rules
In other words: governance itself becomes partially automatable.
Implications — Why this changes the conversation
GAVEL reframes activation safety from a model-centric problem into a policy-centric one.
For regulators, this matters because:
- Rules are inspectable
- Violations are explainable
- Enforcement logic is explicit
For enterprises, it matters because:
- Policies can be updated without retraining
- Domain-specific constraints are finally practical
- False positives stop eroding trust
And for the research community, GAVEL suggests a future where AI safety evolves like cybersecurity did: through shared vocabularies, rule repositories, and incremental refinement.
The trade-off is philosophical. Boolean rules feel rigid in a probabilistic world. But for high-stakes governance, ambiguity is not a virtue — it is a liability.
Conclusion — From alignment to accountability
GAVEL does not replace alignment, RLHF, or moderation APIs. It complements them with something they lack: explicit guarantees.
By decomposing cognition into modular elements and reintroducing rules as first-class citizens, the paper offers a credible bridge between interpretability research and enforceable AI governance.
Not a silver bullet. But finally, a gavel that can strike with reasons.
Cognaptus: Automate the Present, Incubate the Future.