Cover image

Preference Laundering: How RLHF Can Turn Better Answers Into Bigger Biases

Feedback sounds clean. A user tries two model answers. One is more helpful, safer, more complete, and less obviously stupid. The other is worse. The annotator picks the better one. The reward model learns from that preference. The policy is optimized. Everyone goes home believing that the system has become more aligned. ...

June 5, 2026 · 18 min · Zelina