Reward Hacking on Cognaptus

Reward Hacking on Cognaptus https://cognaptus.com/tags/reward-hacking/ Recent content in Reward Hacking on Cognaptus Hugo -- 0.145.0 en-us Wed, 01 Apr 2026 00:00:00 +0000 Approval Isn’t Free: When AI Safety Trades Capability for Control https://cognaptus.com/blog/2026-04-01-approval-isnt-free-when-ai-safety-trades-capability-for-control/ Wed, 01 Apr 2026 00:00:00 +0000 https://cognaptus.com/blog/2026-04-01-approval-isnt-free-when-ai-safety-trades-capability-for-control/ A mechanism-first reading of MONA’s Camera Dropbox extension, showing why learned approval can suppress reward hacking without recovering useful capability. Goodhart’s Agent: When AI Improves the Score Instead of the Model https://cognaptus.com/blog/2026-03-15-goodharts-agent-when-ai-improves-the-score-instead-of-the-model/ Sun, 15 Mar 2026 00:00:00 +0000 https://cognaptus.com/blog/2026-03-15-goodharts-agent-when-ai-improves-the-score-instead-of-the-model/ A comparison-based reading of RewardHackingAgents, showing why ML-agent evaluation needs both protected scorers and protected data access—not just higher benchmark numbers.