Probe and Error: Why Off‑Policy Training Warps LLM Behaviour Detectors
A monitor is only useful if it fails in the boring place. The boring place is production: the real domain, the real prompt style, the real user incentives, the real model generating the real response. Not the tidy benchmark. Not the synthetic dataset. Not the “please pretend to be deceptive” prompt that makes everyone in the lab feel productive. Production is where a detector either catches the thing it was built to catch, or quietly becomes a compliance ornament with a nice AUROC score. ...