Time to Prefer: Why Binary RLHF Feedback Leaves Reward Models Guessing
Time to Prefer: Why Binary RLHF Feedback Leaves Reward Models Guessing Thumbs-up feedback looks efficient. It is clean, cheap, easy to store, and friendly to dashboards. One output wins, another output loses, and the reward model learns what humans supposedly want. A tidy little morality market, with all the nuance of a vending machine. ...