Probe and Error: Why Off‑Policy Training Warps LLM Behaviour Detectors

Opening — Why this matters now

LLM governance is entering its awkward adolescence. Models behave well in demos, misbehave in deployment, and—most annoyingly—lie about whether they are misbehaving. Behaviour probes promised a clean solution: tiny classifiers that sit atop the model’s hidden states and report when deception, sycophancy, or other governance‑relevant traits appear.

But the paper “That’s Not Natural: The Impact of Off‑Policy Training Data on Probe Performance” delivers an uncomfortable reminder: if you train probes on synthetic or off‑policy data, they don’t learn the behaviour. They learn the artefacts.

And governance built on artefacts is governance with blind spots.

Background — Context and prior art

Behavioural probes—especially linear and attention‑based probes—have emerged as low‑cost tools for inference‑time monitoring. They take hidden activations, map them to a score, and attempt to predict whether a specific behaviour is present.

Two forces made probes attractive:

Low data cost: Hidden states already encode enormous semantic structure.
Low overhead: They can run in real time, even at scale.

However, real examples of problematic behaviours (e.g., deception, jailbreaking, malicious refusal) are rare. Researchers naturally turned to synthetic data by prompting large models to generate examples of the target behaviours.

This paper’s team decided to test whether relying on such off‑policy data subtly sabotages generalisation.

Spoiler: it does.

Analysis — What the paper actually does

The authors evaluate eight LLM behaviours across four data regimes:

Natural human‑written examples
Synthetic in‑distribution (same model, cooperative prompts)
Synthetic off‑policy (different models or adversarial instruction)
Mixed datasets

They train two types of probes:

Linear probes, cheap but surprisingly strong
Attention probes, which learn position‑aware representations

Then they measure how probes trained under each regime perform on natural, human‑written evaluation sets.

Core finding

Synthetic data—especially off‑policy synthetic data—induces probes to latch onto spurious cues. They perform well on similar synthetic data but collapse on natural examples.

In other words, probes trained on synthetic deception examples learn to detect “the way models write when told to act deceptive,” not deception itself.

The problem is structural: off‑policy samples shift linguistic style, length, token distribution, and reasoning patterns. Probes mistake these differences for the underlying behaviour.

Findings — Results with visualization

The paper provides several tables comparing cross‑regime generalisation. Below is a distilled reproduction capturing the structure of their analysis.

Table 1 — Generalisation of probes trained under different regimes

Training Regime	Avg. Performance on Synthetic Eval	Avg. Performance on Natural Eval	Failure Mode
Natural	High	High	Limited data
Synthetic (in‑dist)	Very High	Medium	Style overfitting
Synthetic (off‑policy)	Very High	Low	Behaviour misalignment
Mixed	High	Medium‑High	Residual artefact learning

Table 2 — Behaviour‑specific degradation (sketch)

Behaviour	Natural→Synthetic Gap	Off‑Policy→Natural Gap	Notes
Deception	Small	Large	Models simulate deception inconsistently
Sycophancy	Moderate	Moderate	Over‑polite synthetic samples distort signal
Refusal Patterns	Small	Large	Style variability dominates semantics
Self‑contradiction	Moderate	Very Large	Off‑policy samples exaggerate contradictions

Across the board, probe performance collapses when trained on data not matching the distribution of natural behaviour.

Implications — Why this matters for businesses and regulators

This result lands in an uncomfortable place.

If your safety, compliance, or product‑quality monitoring relies on probes trained on synthetic examples, you may be detecting artefacts instead of actual harms.

For enterprises

Governance dashboards may be overconfident. Probes report “all clean” while real misbehaviour goes undetected.
Synthetic data pipelines need auditing. Off‑policy generation should be treated as a risk, not a convenience.
Monitoring costs may rise. Human‑curated natural examples—or carefully constrained synthetic generation—are necessary.

For regulators

Governance frameworks must account for distribution shift. An assurance claim like “we probe for deception” is meaningless without disclosure of data provenance.
Model providers should report probe robustness the same way cybersecurity firms report threat‑model coverage.

For the AI ecosystem

Probe‑based detection is still promising, but it needs:
- better data collection
- intentional distribution matching
- and stress‑testing on natural adversarial samples

In short: probes aren’t doomed. But naive synthetic data is.

Conclusion — Wrap‑up

This paper highlights a subtle but critical governance failure mode: probes generalise to the training regime, not the behaviour. Synthetic or off‑policy data introduces artefacts that mimic, exaggerate, or replace the true target behaviour.

If we want reliable monitoring, we must invest in natural examples—or develop synthetic pipelines that faithfully reproduce natural distributions.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — Context and prior art#

Analysis — What the paper actually does#

Core finding#

Findings — Results with visualization#

Table 1 — Generalisation of probes trained under different regimes#

Table 2 — Behaviour‑specific degradation (sketch)#

Implications — Why this matters for businesses and regulators#

For enterprises#

For regulators#

For the AI ecosystem#

Conclusion — Wrap‑up#