Sirens in the Weights: Why AI Safety May Be Hiding Inside the Model

Moderation usually sits outside the model.

A user sends a prompt. A model answers. Then a separate guard model steps in, reads the text, and declares the content safe or unsafe. In business terms, this is a familiar architecture: put a checkpoint at the gate, classify traffic, block what violates policy, and hope the checkpoint is both fast and sensible. It is the airport-security model of AI safety, except the passenger may be a 40-token prompt, a 4,000-token reasoning trace, or a response that is still being generated while the guard is politely looking for its shoes.

The paper behind SIREN asks a different question: what if the model already knows, somewhere inside its own activations, that the content is drifting into unsafe territory before the final text has been wrapped into a clean output label?¹

That shift matters. Most current guard systems treat safety detection as another language-generation problem: feed the content into a specialized LLM and ask it to generate a label such as safe or unsafe. SIREN treats safety detection as an internal-representation problem. Instead of waiting for a terminal-layer output, it extracts safety-relevant signals from many layers of a frozen language model, selects a sparse set of “safety neurons,” aggregates them, and trains a lightweight classifier on top.

The result is not merely a smaller guard. The more interesting claim is architectural: safety signals may be distributed across the model’s internal computation, and the final output layer may be a surprisingly narrow place to look for them. Very reassuring, of course, that the alarm was inside the machine all along. We just needed to listen before the machine finished speaking.

The usual guard model listens at the end of the conversation

The mainstream LLM guard setup has two conveniences. First, it is operationally simple. A guard model can be deployed as an external moderation service, often without touching the underlying application model. Second, it fits the instruction-following habits of LLM engineering. Convert safety policy into a taxonomy, fine-tune or prompt a guard model, and ask it to produce a classification.

The weakness is equally straightforward: generative guards usually rely on the model’s final representations and then spend compute decoding a label. This works, but it also throws away much of what happened inside the model while it was processing the input. The paper’s core complaint is not that terminal-layer guards are useless. The complaint is that they are late and information-poor relative to what the internal layers may already contain.

SIREN’s wager is that harmfulness is not only a final judgment. It is also a pattern in intermediate activations.

That distinction is easy to miss. If a guard says “unsafe,” we usually think the important thing is the final label. But the label is only the last mile. The internal path to that label can contain earlier, richer, and more granular signals: malicious intent, unsafe instruction structure, toxic framing, refusal-related features, and other safety-relevant semantics. The paper does not claim to solve the entire problem of AI safety. It narrows the problem to binary harmfulness detection, then asks whether internal features are a better substrate for that task.

SIREN changes the unit of safety from output labels to internal features

SIREN, short for Safeguard with Internal Representation, has two stages.

First, it extracts hidden representations from each transformer layer of a frozen LLM. The paper considers residual stream representations and feedforward activations, pools token-level representations into sequence-level features, and trains layer-wise linear probes to predict harmful versus safe labels. These probes are not the final product. They are diagnostic tools for finding which neurons matter most for safety classification.

Second, SIREN selects safety neurons within each layer using the magnitude of the learned probe weights, then aggregates selected neurons across layers. Layers are not treated equally. The aggregation is weighted by each layer probe’s validation F1 score, so layers that appear more informative for harmfulness detection receive more influence. The final classifier is a small MLP trained on this concatenated, weighted safety-neuron representation.

A compact version of the pipeline looks like this:

Input text
  -> frozen LLM backbone
  -> internal layer activations
  -> layer-wise probes identify safety-relevant neurons
  -> selected neurons are weighted by layer usefulness
  -> MLP classifier predicts harmful vs safe

Two details are important for business readers.

First, the base LLM stays frozen. SIREN is not a proposal to fine-tune another full guard model every time the policy team sneezes. It adds a lightweight classifier on top of internal states.

Second, SIREN needs access to those internal states. That makes it most relevant to open-weight, self-hosted, or infrastructure-level deployments where the operator can instrument the model. It is less directly applicable when a company only calls a closed model through a text-only API. The airport-security desk cannot inspect the engine room if the airline refuses to open the door.

The main benchmark result is strong, but the matched comparison is the real point

The paper evaluates SIREN against state-of-the-art open-source guard models, including LlamaGuard3 and Qwen3Guard. The comparison is designed to be fair in an important way: SIREN is trained on the same general-purpose backbones from which the corresponding guard models are built. That means the experiment is not simply “our model versus some random baseline.” It asks whether internal-representation extraction from the general model can beat a safety-specialized guard derived from the same family.

Across seven harmfulness benchmarks, SIREN outperforms the matched guard models on average F1:

Backbone pair	SIREN average F1	Matched guard average F1	Interpretation
Qwen3-0.6B / Qwen3Guard-0.6B	85.6	81.7	SIREN improves a small backbone without safety fine-tuning the full model.
Llama3.2-1B / LlamaGuard3-1B	85.7	70.7	The largest gap; the guard baseline is weak under this evaluation.
Qwen3-4B / Qwen3Guard-4B	86.7	83.4	The strongest single SIREN result in the paper.
Llama3.1-8B / LlamaGuard3-8B	86.3	77.0	Internal features outperform a larger safety-specialized guard.

The Qwen3-4B comparison is the clean headline: SIREN reaches 86.7 average F1, compared with 83.4 for Qwen3Guard-4B. That is a meaningful improvement, but the more revealing number is the Llama3.2-1B gap: 85.7 versus 70.7. It suggests that for some backbones, training a full generative guard may be a clumsy way to access safety information that is already present inside the general model.

The paper also reports a precision-recall analysis. SIREN tends to sit closer to a balanced precision-recall diagonal across benchmarks, while several guard baselines show larger swings. For a product team, this matters because moderation failures rarely arrive as a single scalar metric. High recall with poor precision creates unnecessary blocking and user frustration. High precision with poor recall creates risk exposure. A guard that behaves differently across datasets may be technically impressive and operationally annoying, which is a popular combination in enterprise AI.

The authors speculate that broad pretraining may encode more general harmfulness concepts than a guard fine-tuned on narrower policy distributions. That interpretation is plausible, but it remains an interpretation. The direct result is simpler: in these tests, the internal-feature classifier is more consistent than the matched generative guard baselines.

The mechanism evidence says the middle layers are doing real safety work

The strongest part of the paper is not the performance table. It is the mechanism evidence that explains why SIREN works.

The authors examine layer-wise linear probes and find that individual probes can come within about four F1 points of fine-tuned guard models, with middle layers performing best and peaking around 79% average F1. The middle-layer result is important. It supports the view that safety-relevant semantics are not merely a final-output artifact. They appear to be represented before the model has collapsed its computation back toward next-token prediction.

This also explains why SIREN aggregates across layers. If one layer were enough, the method could simply attach a classifier to that layer and call it a day. But the paper reports that cross-layer aggregation produces an additional eight-point improvement over layer-wise probes. The obvious reading is that harmfulness detection benefits from multiple levels of representation: lexical clues, intent patterns, abstract semantic structure, and perhaps response-stage signals.

The ablation tests help separate the paper’s main claim from implementation decoration:

Test	Likely purpose	What it supports	What it does not prove
Seven-benchmark comparison	Main evidence	SIREN beats matched open-source guard models on average F1.	It does not prove universal superiority across all safety policies or deployment settings.
Precision-recall analysis	Main evidence / behavior analysis	SIREN appears more balanced across datasets.	It does not guarantee calibrated thresholds for a specific company policy.
Layer-wise probes	Mechanism evidence	Internal layers, especially middle layers, contain safety-relevant signals.	It does not fully explain what each neuron semantically represents.
Neuron selection threshold ablation	Robustness / sensitivity test	Performance remains stable across a useful sparse-neuron range.	It does not remove the need for validation on new model families.
Adaptive vs uniform aggregation	Ablation	Layer weighting adds roughly 1.0–1.3 F1 points over uniform aggregation.	It does not make layer weighting the only possible aggregation strategy.
Think reasoning-trace benchmark	Generalization test	SIREN transfers to unseen reasoning-trace safety detection.	It does not cover every form of chain-of-thought or hidden reasoning deployment.
Streaming detection	Exploratory extension / generalization	Sequence-level SIREN can be reused for token-prefix detection without retraining.	It does not by itself define the production threshold policy.
Cross-model ensemble	Exploratory extension	Combining SIREN instances can raise average F1 to 87.7.	It adds inference cost and is not the core single-model claim.
Applying SIREN to guard models	Exploratory extension	SIREN can enhance existing specialized guards in-place.	It still requires internal access to the guard model.

That table matters because it prevents a common reading error. Not every experiment in the paper is a second thesis. Some tests establish the main result. Some test sensitivity. Some explore possible extensions. Mixing them together makes the paper look like a pile of tricks. Read correctly, the paper has one central thesis: internal safety representations are useful, sparse, and operationally exploitable.

Sparse safety neurons are where the efficiency story begins

The efficiency claim has two parts: training efficiency and inference efficiency.

For training, the paper reports that SIREN introduces only 14 million trainable parameters on Qwen3-4B, compared with the billion-level parameter scale of a full fine-tuned guard model. It also reports that training SIREN on Qwen3-4B completes in six A100 GPU hours. These are not just nice engineering details. They change who can realistically maintain a guard system.

A company that cannot afford repeated full-model safety fine-tuning may still be able to train and update a lightweight classifier over internal activations. That is especially relevant for organizations deploying domain-specific LLMs in finance, education, healthcare operations, customer support, or internal knowledge systems, where moderation policies may be narrower than public-platform safety taxonomies but still need to be enforced consistently.

For inference, SIREN avoids autoregressive label generation. It performs classification using hidden states from a single forward pass plus a small classifier. The paper estimates that generative guards require about four times higher computational cost under conservative assumptions: perfect KV-cache use and only four generated tokens. In practice, the authors note that their other evaluations allow up to 128 new tokens, meaning the four-token assumption is intentionally favorable to the guard models.

The business implication is not “SIREN makes safety free.” It does not. You still need to run the backbone, extract activations, store or stream the right tensors, and maintain a classifier. The better interpretation is that SIREN changes safety overhead from “run another model that generates a label” toward “reuse internal computation and classify a compressed safety signal.” For high-volume systems, that distinction can become budget, latency, and throughput.

Streaming detection is the most operationally interesting extension

The streaming result deserves more attention than a normal benchmark paragraph.

Modern LLM outputs are often streamed token by token. That creates a safety timing problem. A sequence-level guard can inspect the final answer, but by then unsafe content may already have been emitted. The common workaround is awkward: buffer more output before showing it, interrupt generation, or run specialized streaming guards that require extra design and supervision.

SIREN transfers to streaming in a surprisingly direct way. The authors apply the same feature extractor to progressively longer prefixes of the generated sequence, using mean pooling over the internal activations available up to each token. No parameters are updated for streaming detection. In the Think benchmark setup, where unsafe spans are annotated, SIREN is evaluated at the unsafe boundary and at grace-period positions up to 256 tokens later.

The paper reports that SIREN consistently captures more harmful examples than Qwen3Guard-Stream across detection latency positions. It also highlights a representative example where the harmfulness score remains low during initially benign reasoning and rises sharply when the content turns dangerous.

This is not just a technical convenience. It changes moderation from post-hoc inspection to live signal monitoring. A deployment could, in principle, use softer thresholds during early ambiguous reasoning and tighten thresholds when the output moves toward user-visible content. The appendix explicitly notes that SIREN produces continuous harmfulness scores, not only discrete labels, making threshold control more flexible than a generative guard that emits categorical text.

That flexibility is valuable, but it is also where governance work begins. Continuous scores are not policies. Someone still has to decide what threshold blocks, warns, routes to review, or merely logs. SIREN gives the dashboard more knobs. It does not tell the pilot where to fly.

What this means for enterprise AI deployment

For enterprise teams, the practical significance of SIREN is easiest to see in four deployment patterns.

Enterprise need	How SIREN could help	Boundary to remember
Lower-latency moderation	Classification can avoid generative label decoding.	It still requires internal-state access and engineering integration.
Cheaper guard adaptation	A small classifier can be trained over frozen-model activations.	Policy changes may still require new labeled data and validation.
Streaming response monitoring	Prefix-level harmfulness scores can be computed without streaming-specific retraining.	Thresholds and intervention actions are product-policy decisions.
Diagnostics and auditability	Layer and neuron analysis may reveal where safety signals live.	Neuron importance is not the same as human-readable causal explanation.

This is especially relevant for companies building on open models rather than closed API-only systems. If an enterprise hosts Qwen, Llama, Mistral, or similar transformer backbones in its own infrastructure, internal activation access is technically possible. SIREN-like methods could sit close to the inference stack, reading hidden states as the model processes user prompts and generates responses.

For closed proprietary APIs, the idea is less immediately deployable. Unless the provider exposes internal activations or offers a comparable internal-signal moderation endpoint, customers can only moderate text inputs and outputs. In that world, SIREN is more of a clue for platform providers than a tool for downstream application teams.

The paper also suggests a more subtle governance lesson: safety should not be treated only as a final content classification problem. In model operations, there may be value in monitoring intermediate cognitive states, not because they are magical, but because they can expose risk earlier and with less decoding overhead. That applies beyond harmfulness detection. Similar representation-level monitoring could matter for hallucination risk, sensitive-data leakage, tool misuse, or agentic planning drift. That last sentence is a Cognaptus inference, not a result proven by this paper.

The limits are real, but they are specific

The paper’s limitations are not ornamental. They affect how SIREN should be interpreted.

First, the method depends on the availability and usefulness of internal representations in transformer-based LLMs. The authors rely on linear probing to identify safety-relevant neurons. If a future architecture encodes relevant concepts differently, or if the harmfulness concept is not linearly discoverable within individual layers, the method may require adaptation.

Second, the task is binary harmfulness classification. Many real moderation systems need finer distinctions: self-harm, hate, sexual content, cyber abuse, regulated goods, medical misinformation, financial manipulation, and context-specific policy exceptions. The authors note that the framework can support multi-label classification, but this paper mainly evaluates the binary case. That is a meaningful boundary, not a footnote to wave away.

Third, SIREN may inherit the biases of the underlying LLM. If the base model’s internal representations encode skewed associations about dialect, identity, political speech, or culturally specific expression, a classifier trained on those activations can inherit the problem elegantly. Efficient bias is still bias; it merely arrives faster.

Fourth, SIREN is not model-agnostic in the way an external text-only classifier is model-agnostic. It is plug-and-play only when the operator can access and instrument the model’s internal states. That makes the method powerful for some infrastructure settings and awkward for others.

Finally, benchmark superiority does not equal production readiness. Moderation deployments care about calibration, severity tiers, appeals, logging, red-team drift, jurisdictional rules, and user experience. SIREN improves the signal extraction layer. It does not replace the governance stack.

The better lesson is not “smaller guard,” but “earlier signal”

The tempting headline is that SIREN is a cheaper guard model. That is true enough, but too small.

The deeper point is that safety detection may not need to wait for a model to finish speaking. Internal representations can contain safety-relevant information before the terminal layer produces a final answer and before a separate generative guard spends tokens explaining what everyone already hoped it would classify correctly.

This makes SIREN part of a broader shift in AI operations: from supervising only outputs to monitoring the computation that produces those outputs. For companies deploying open or self-hosted models, that shift could reduce latency, lower guard-training cost, improve streaming moderation, and make safety systems more inspectable. For companies depending entirely on closed text-only APIs, it is a reminder that some of the most useful safety signals may be hidden behind the provider’s abstraction layer.

The paper does not prove that internal monitoring solves AI safety. It proves something narrower and more useful: for harmfulness detection, a sparse set of safety-relevant internal features can outperform matched open-source guard models across several benchmarks, generalize to reasoning traces and streaming detection, and do so with much lower trainable-parameter overhead.

That is enough to make the architecture worth watching. Not because it gives us a perfect guardrail, but because it moves the guardrail closer to where the model is actually thinking.

Cognaptus: Automate the Present, Incubate the Future.

Difan Jiao, Yilun Liu, Ye Yuan, Zhenwei Tang, Linfeng Du, Haolun Wu, and Ashton Anderson, “LLM Safety From Within: Detecting Harmful Content with Internal Representations,” arXiv:2604.18519v1, 20 Apr 2026. ↩︎

The usual guard model listens at the end of the conversation#

SIREN changes the unit of safety from output labels to internal features#

The main benchmark result is strong, but the matched comparison is the real point#

The mechanism evidence says the middle layers are doing real safety work#

Sparse safety neurons are where the efficiency story begins#

Streaming detection is the most operationally interesting extension#

What this means for enterprise AI deployment#

The limits are real, but they are specific#

The better lesson is not “smaller guard,” but “earlier signal”#