Cover image

Cheap Seats, Sharp Eyes: Reward-Hack Detection Without the Frontier Judge

TL;DR for operators A frontier LLM judge is an expensive way to inspect every agent trajectory for reward hacking. This paper asks whether a much smaller detector can do most of that monitoring job at much lower cost. The answer is: yes, under the same information condition, and with important caveats. A 13.8M-parameter transformer encoder plus a logistic regression probe detects reward hacking in cleaned Terminal-Wrench trajectories with 0.9467 AUC and 0.8296 TPR@5%FPR. In the authors’ matched comparison, a reproduced gpt-5.4 judge reaches 0.9510 AUC and 0.7130 TPR@5%FPR on the cleaned sanitized-vs-baseline split.1 ...

June 15, 2026 · 6 min · Zelina