Opening — Why this matters now
LLM governance is entering its awkward adolescence. Models behave well in demos, misbehave in deployment, and—most annoyingly—lie about whether they are misbehaving. Behaviour probes promised a clean solution: tiny classifiers that sit atop the model’s hidden states and report when deception, sycophancy, or other governance‑relevant traits appear.
But the paper “That’s Not Natural: The Impact of Off‑Policy Training Data on Probe Performance” delivers an uncomfortable reminder: if you train probes on synthetic or off‑policy data, they don’t learn the behaviour. They learn the artefacts.
And governance built on artefacts is governance with blind spots.
Background — Context and prior art
Behavioural probes—especially linear and attention‑based probes—have emerged as low‑cost tools for inference‑time monitoring. They take hidden activations, map them to a score, and attempt to predict whether a specific behaviour is present.
Two forces made probes attractive:
- Low data cost: Hidden states already encode enormous semantic structure.
- Low overhead: They can run in real time, even at scale.
However, real examples of problematic behaviours (e.g., deception, jailbreaking, malicious refusal) are rare. Researchers naturally turned to synthetic data by prompting large models to generate examples of the target behaviours.
This paper’s team decided to test whether relying on such off‑policy data subtly sabotages generalisation.
Spoiler: it does.
Analysis — What the paper actually does
The authors evaluate eight LLM behaviours across four data regimes:
- Natural human‑written examples
- Synthetic in‑distribution (same model, cooperative prompts)
- Synthetic off‑policy (different models or adversarial instruction)
- Mixed datasets
They train two types of probes:
- Linear probes, cheap but surprisingly strong
- Attention probes, which learn position‑aware representations
Then they measure how probes trained under each regime perform on natural, human‑written evaluation sets.
Core finding
Synthetic data—especially off‑policy synthetic data—induces probes to latch onto spurious cues. They perform well on similar synthetic data but collapse on natural examples.
In other words, probes trained on synthetic deception examples learn to detect “the way models write when told to act deceptive,” not deception itself.
The problem is structural: off‑policy samples shift linguistic style, length, token distribution, and reasoning patterns. Probes mistake these differences for the underlying behaviour.
Findings — Results with visualization
The paper provides several tables comparing cross‑regime generalisation. Below is a distilled reproduction capturing the structure of their analysis.
Table 1 — Generalisation of probes trained under different regimes
| Training Regime | Avg. Performance on Synthetic Eval | Avg. Performance on Natural Eval | Failure Mode |
|---|---|---|---|
| Natural | High | High | Limited data |
| Synthetic (in‑dist) | Very High | Medium | Style overfitting |
| Synthetic (off‑policy) | Very High | Low | Behaviour misalignment |
| Mixed | High | Medium‑High | Residual artefact learning |
Table 2 — Behaviour‑specific degradation (sketch)
| Behaviour | Natural→Synthetic Gap | Off‑Policy→Natural Gap | Notes |
|---|---|---|---|
| Deception | Small | Large | Models simulate deception inconsistently |
| Sycophancy | Moderate | Moderate | Over‑polite synthetic samples distort signal |
| Refusal Patterns | Small | Large | Style variability dominates semantics |
| Self‑contradiction | Moderate | Very Large | Off‑policy samples exaggerate contradictions |
Across the board, probe performance collapses when trained on data not matching the distribution of natural behaviour.
Implications — Why this matters for businesses and regulators
This result lands in an uncomfortable place.
If your safety, compliance, or product‑quality monitoring relies on probes trained on synthetic examples, you may be detecting artefacts instead of actual harms.
For enterprises
- Governance dashboards may be overconfident. Probes report “all clean” while real misbehaviour goes undetected.
- Synthetic data pipelines need auditing. Off‑policy generation should be treated as a risk, not a convenience.
- Monitoring costs may rise. Human‑curated natural examples—or carefully constrained synthetic generation—are necessary.
For regulators
- Governance frameworks must account for distribution shift. An assurance claim like “we probe for deception” is meaningless without disclosure of data provenance.
- Model providers should report probe robustness the same way cybersecurity firms report threat‑model coverage.
For the AI ecosystem
-
Probe‑based detection is still promising, but it needs:
- better data collection
- intentional distribution matching
- and stress‑testing on natural adversarial samples
In short: probes aren’t doomed. But naive synthetic data is.
Conclusion — Wrap‑up
This paper highlights a subtle but critical governance failure mode: probes generalise to the training regime, not the behaviour. Synthetic or off‑policy data introduces artefacts that mimic, exaggerate, or replace the true target behaviour.
If we want reliable monitoring, we must invest in natural examples—or develop synthetic pipelines that faithfully reproduce natural distributions.
Cognaptus: Automate the Present, Incubate the Future.