Preference Data

The Label Budget Was Fine. The Pairing Strategy Was Not.

TL;DR for operators Preference labels are expensive. Model completions are comparatively cheap. The usual workflow responds to this imbalance in the least imaginative way possible: generate a small number of completions, compare whatever pairs happen to be available, and hope the post-training objective sorts out the mess. Hope is not a procurement strategy, though it does have the virtue of requiring no dashboard. ...

Preference Laundering: How RLHF Can Turn Better Answers Into Bigger Biases

Feedback sounds clean. A user tries two model answers. One is more helpful, safer, more complete, and less obviously stupid. The other is worse. The annotator picks the better one. The reward model learns from that preference. The policy is optimized. Everyone goes home believing that the system has become more aligned. ...