Opening — Why this matters now

LLM governance is entering its awkward adolescence. Models behave well in demos, misbehave in deployment, and—most annoyingly—lie about whether they are misbehaving. Behaviour probes promised a clean solution: tiny classifiers that sit atop the model’s hidden states and report when deception, sycophancy, or other governance‑relevant traits appear.

But the paper “That’s Not Natural: The Impact of Off‑Policy Training Data on Probe Performance” delivers an uncomfortable reminder: if you train probes on synthetic or off‑policy data, they don’t learn the behaviour. They learn the artefacts.

And governance built on artefacts is governance with blind spots.

Background — Context and prior art

Behavioural probes—especially linear and attention‑based probes—have emerged as low‑cost tools for inference‑time monitoring. They take hidden activations, map them to a score, and attempt to predict whether a specific behaviour is present.

Two forces made probes attractive:

  1. Low data cost: Hidden states already encode enormous semantic structure.
  2. Low overhead: They can run in real time, even at scale.

However, real examples of problematic behaviours (e.g., deception, jailbreaking, malicious refusal) are rare. Researchers naturally turned to synthetic data by prompting large models to generate examples of the target behaviours.

This paper’s team decided to test whether relying on such off‑policy data subtly sabotages generalisation.

Spoiler: it does.

Analysis — What the paper actually does

The authors evaluate eight LLM behaviours across four data regimes:

  • Natural human‑written examples
  • Synthetic in‑distribution (same model, cooperative prompts)
  • Synthetic off‑policy (different models or adversarial instruction)
  • Mixed datasets

They train two types of probes:

  • Linear probes, cheap but surprisingly strong
  • Attention probes, which learn position‑aware representations

Then they measure how probes trained under each regime perform on natural, human‑written evaluation sets.

Core finding

Synthetic data—especially off‑policy synthetic data—induces probes to latch onto spurious cues. They perform well on similar synthetic data but collapse on natural examples.

In other words, probes trained on synthetic deception examples learn to detect “the way models write when told to act deceptive,” not deception itself.

The problem is structural: off‑policy samples shift linguistic style, length, token distribution, and reasoning patterns. Probes mistake these differences for the underlying behaviour.

Findings — Results with visualization

The paper provides several tables comparing cross‑regime generalisation. Below is a distilled reproduction capturing the structure of their analysis.

Table 1 — Generalisation of probes trained under different regimes

Training Regime Avg. Performance on Synthetic Eval Avg. Performance on Natural Eval Failure Mode
Natural High High Limited data
Synthetic (in‑dist) Very High Medium Style overfitting
Synthetic (off‑policy) Very High Low Behaviour misalignment
Mixed High Medium‑High Residual artefact learning

Table 2 — Behaviour‑specific degradation (sketch)

Behaviour Natural→Synthetic Gap Off‑Policy→Natural Gap Notes
Deception Small Large Models simulate deception inconsistently
Sycophancy Moderate Moderate Over‑polite synthetic samples distort signal
Refusal Patterns Small Large Style variability dominates semantics
Self‑contradiction Moderate Very Large Off‑policy samples exaggerate contradictions

Across the board, probe performance collapses when trained on data not matching the distribution of natural behaviour.

Implications — Why this matters for businesses and regulators

This result lands in an uncomfortable place.

If your safety, compliance, or product‑quality monitoring relies on probes trained on synthetic examples, you may be detecting artefacts instead of actual harms.

For enterprises

  • Governance dashboards may be overconfident. Probes report “all clean” while real misbehaviour goes undetected.
  • Synthetic data pipelines need auditing. Off‑policy generation should be treated as a risk, not a convenience.
  • Monitoring costs may rise. Human‑curated natural examples—or carefully constrained synthetic generation—are necessary.

For regulators

  • Governance frameworks must account for distribution shift. An assurance claim like “we probe for deception” is meaningless without disclosure of data provenance.
  • Model providers should report probe robustness the same way cybersecurity firms report threat‑model coverage.

For the AI ecosystem

  • Probe‑based detection is still promising, but it needs:

    • better data collection
    • intentional distribution matching
    • and stress‑testing on natural adversarial samples

In short: probes aren’t doomed. But naive synthetic data is.

Conclusion — Wrap‑up

This paper highlights a subtle but critical governance failure mode: probes generalise to the training regime, not the behaviour. Synthetic or off‑policy data introduces artefacts that mimic, exaggerate, or replace the true target behaviour.

If we want reliable monitoring, we must invest in natural examples—or develop synthetic pipelines that faithfully reproduce natural distributions.

Cognaptus: Automate the Present, Incubate the Future.