The Label Budget Was Fine. The Pairing Strategy Was Not.
TL;DR for operators Preference labels are expensive. Model completions are comparatively cheap. The usual workflow responds to this imbalance in the least imaginative way possible: generate a small number of completions, compare whatever pairs happen to be available, and hope the post-training objective sorts out the mess. Hope is not a procurement strategy, though it does have the virtue of requiring no dashboard. ...