Opening — Why this matters now

AI safety is drifting toward an uncomfortable paradox. The more capable large language models become, the less transparent their internal decision-making appears — and the more brittle our existing safeguards feel. Text-based moderation catches what models say, not what they are doing. Activation-based safety promised to fix this, but in practice it has inherited many of the same flaws: coarse labels, opaque triggers, and painful retraining cycles.

The paper “GAVEL: Towards Rule-Based Safety Through Activation Monitoring” (ICLR 2026) argues that we are missing a conceptual layer. Not a better classifier. Not a bigger dataset. But a language of governance for model internals.

Background — From detectors to doctrines

Activation monitoring emerged as a response to representation attacks: jailbreaks, paraphrasing, and goal hijacking that evade surface-text filters. By watching internal activations, safety systems can detect when a model reasons about threats, scams, or manipulation — even if the final text looks benign.

But the dominant approach has been blunt. Detectors are trained on broad misuse datasets (“phishing”, “hate speech”, “cybercrime”), producing three systemic problems:

  1. Low precision — benign discussions trigger alarms because the detector learns correlations, not intent.
  2. Low flexibility — every new policy requires new data and retraining.
  3. Low interpretability — when a detector fires, nobody can explain why.

GAVEL’s authors borrow an analogy from cybersecurity: intrusion detection systems didn’t scale because they had better neural nets. They scaled because they had rules.

Analysis — What the paper actually does

Cognitive Elements: turning activations into primitives

GAVEL introduces Cognitive Elements (CEs) — mid-level, interpretable units of model behavior. Think less “this is a scam” and more “the model is issuing a threat”, “requesting payment tools”, or “building trust through role-play.”

Each CE is trained independently using excitation datasets: short text snippets designed to reliably elicit that specific cognitive behavior. Activations are collected at the token level, focusing on attention outputs in mid-to-late transformer layers (empirically the most semantically rich).

The result is a modular vocabulary of model behaviors that can co-occur, overlap, and recombine — without ever needing a combinatorial dataset.

Rules, not retraining

Once CEs are detectable, GAVEL does something deceptively simple: it applies Boolean logic.

Policies are written as rules such as:

Stop if Threaten ∧ PaymentTools ∧ MasqueradeAsHuman

Rules are evaluated over a temporal window of tokens, allowing behaviors to accumulate across a conversation — critical for scam automation and long-horizon manipulation.

Crucially:

  • Rules are human-readable
  • Rules are auditable
  • Rules can be shared, versioned, and reused

This is governance logic, not just safety heuristics.

Architecture in practice

At runtime, GAVEL adds a lightweight multi-label classifier that maps each token’s activations to CE probabilities. These signals are aggregated, evaluated against rule predicates, and trigger enforcement actions (alert, refuse, stop, or steer).

The reported overhead is negligible: under 1% latency increase in real-time inference.

Findings — What actually works (and what breaks)

Precision beats brute force

Across nine misuse scenarios (phishing, SQL injection, election interference, romance scams, etc.), GAVEL consistently achieves:

  • ROC-AUC ≈ 0.99
  • Near-zero false positives, even against carefully constructed benign lookalikes

By contrast, activation classifiers trained on whole misuse categories exhibit classic overgeneralization. They detect themes. GAVEL detects intent structures.

Robust to adversarial obfuscation

Text-based moderation collapses under misdirection and goal hijacking. GAVEL does not.

Because Cognitive Elements track internal reasoning, adversarial padding and paraphrasing fail to suppress detection. In multiple categories, GPT-4-based judges dropped below 20% recall under misdirection attacks; GAVEL stayed above 85%.

Model- and language-agnostic

CE datasets are text-only. The same ruleset transfers across:

  • Mistral-7B
  • LLaMA-3
  • Qwen-8B
  • Gemma-4B

Even more quietly impressive: CE detectors trained in English generalized cleanly to Spanish and Mandarin, suggesting that these primitives capture abstract cognition rather than surface language.

Automation closes the loop

The authors also demonstrate an agentic pipeline that can:

  1. Take a natural-language policy description
  2. Propose new CEs
  3. Generate excitation datasets
  4. Synthesize rules

In other words: governance itself becomes partially automatable.

Implications — Why this changes the conversation

GAVEL reframes activation safety from a model-centric problem into a policy-centric one.

For regulators, this matters because:

  • Rules are inspectable
  • Violations are explainable
  • Enforcement logic is explicit

For enterprises, it matters because:

  • Policies can be updated without retraining
  • Domain-specific constraints are finally practical
  • False positives stop eroding trust

And for the research community, GAVEL suggests a future where AI safety evolves like cybersecurity did: through shared vocabularies, rule repositories, and incremental refinement.

The trade-off is philosophical. Boolean rules feel rigid in a probabilistic world. But for high-stakes governance, ambiguity is not a virtue — it is a liability.

Conclusion — From alignment to accountability

GAVEL does not replace alignment, RLHF, or moderation APIs. It complements them with something they lack: explicit guarantees.

By decomposing cognition into modular elements and reintroducing rules as first-class citizens, the paper offers a credible bridge between interpretability research and enforceable AI governance.

Not a silver bullet. But finally, a gavel that can strike with reasons.

Cognaptus: Automate the Present, Incubate the Future.