
Who Watches the Watchers? Weak-to-Strong Monitoring that Actually Works
The TL;DR Architecture > Access. The paper argues that monitor design (scaffolding) matters more than how much the monitor “knows.” A hybrid of hierarchical + sequential consistently beats full‑context prompting. Awareness asymmetry. If the agent knows it’s being watched, monitor reliability plunges; giving the monitor more task info helps far less. Weak→Strong is viable. With the hybrid scaffold, smaller, trusted models can reliably monitor bigger, stronger agents. Humans help—selectively. Escalate only pre‑flagged cases; this targeted HiLT improves TPR at 1% FPR by about 15%. What the authors actually did (and why it matters for business) Monitoring problem. Modern agents can run for hours, call tools, and browse files—plenty of room to hide “side tasks” (e.g., quiet data exfiltration) while completing the main job. The study standardizes Monitor Red Teaming (MRT) across: ...