Time to Prefer: Why Binary RLHF Feedback Leaves Reward Models Guessing
Thumbs-up feedback looks efficient. It is clean, cheap, easy to store, and friendly to dashboards. One output wins, another output loses, and the reward model learns what humans supposedly want. A tidy little morality market, with all the nuance of a vending machine.
The problem is not that pairwise feedback is useless. It is that pairwise feedback can be too thin for the job people now want it to do. Modern AI systems are no longer being aligned only to a single average user in a narrow setting. They are increasingly expected to adapt to different people, teams, risk appetites, cultural expectations, expertise levels, and changing contexts. In that world, the question is not merely whether a model can learn from human preference data. The harder question is whether it can infer a new person’s reward structure from a few examples without retraining.
The arXiv paper In-Context Reward Adaptation for Robust Preference Modeling by Zhenyu Sun, Zheng Xu, and Ermin Wei attacks exactly this problem.1 Its main result is pleasantly inconvenient: binary pairwise preferences alone can be structurally insufficient for in-context reward adaptation to unseen human preferences, even under idealized conditions with unlimited demonstrations and perfect optimization. The paper then shows why adding response time can recover information about preference strength and improve robustness under out-of-distribution preference shifts.
That distinction matters. The paper is not simply saying, “collect more user feedback.” It is saying that some feedback formats throw away the part of the signal needed for personalization. More labels may give you a larger pile of missing information. Very enterprise. Very scalable. Still missing.
The business fantasy: one reward model, many humans
A standard RLHF pipeline usually trains a reward model from human preferences and then uses that reward model to guide policy optimization. In the simplest business story, this reward model becomes a reusable proxy for “what humans prefer.” That story is operationally convenient, but only because it quietly compresses a messy population into a single objective.
The paper starts from a more realistic premise: human preferences are heterogeneous. A medical compliance reviewer, a consumer support manager, a retail user, and a crypto trader may all evaluate the same model output differently. Even the same person may shift preferences after gaining experience or facing a different risk context. Static reward models and multi-reward systems can partly address this by training separate reward models for known groups or objectives. But that assumes the relevant groups are known in advance.
The paper’s proposed alternative is in-context reward adaptation. Instead of retraining a reward model whenever preferences change, a transformer conditions on a small set of preference demonstrations from a new human type and infers that person’s reward function at inference time.
Mechanically, the paper models each human type as having a latent reward parameter $\theta_i^*$. Given two candidate responses, the human’s binary preference follows a Bradley–Terry model: the probability of choosing one response over another depends on the reward difference between them. The reward itself is assumed to be linear in a shared feature representation:
$$ r_i(x, y) = \phi(x, y)^\top \theta_i^*. $$
That linear assumption is important. It makes the theory tractable and lets the authors isolate the information problem. This is not a claim that real human preferences are always linear. It is a controlled microscope slide. If binary feedback fails even there, the practical lesson is not comforting.
Binary choices tell direction, not distance
The key mechanism is simple enough to be dangerous: a binary preference tells you which option won, but not how strongly it won.
Suppose two users choose response A over response B. One user may be mildly indifferent: A is a little clearer. Another may see B as dangerously wrong. In both cases, the stored preference label is identical. The direction is preserved; the intensity is mostly gone.
In the paper’s formal setting, each pairwise comparison creates a binary variable $z_i \in {-1, +1}$. The feature difference between two candidate responses is denoted by $\tilde{\phi}$. The true choice probability is governed by $\sigma(\tilde{\phi}^\top \theta_i^*)$, where $\sigma$ is the sigmoid function.
The in-context learner sees a sequence of feature differences and binary labels, then predicts the preference for a query pair from the same unseen human type. If binary labels carried enough information, enough demonstrations should let the learner reconstruct the new human’s reward direction well enough for accurate prediction.
The paper shows that this expectation can fail for structural reasons.
The useful summary statistic recovered from many binary demonstrations is not $\theta_i^$ itself. It is a preference moment: an expectation involving $z_i$ and $\tilde{\phi}$. Because the label is generated through the sigmoid choice model, this moment is generally a nonlinear and compressive transformation of $\theta_i^$. The transformer then applies a learned linear decoding map to this transformed signal.
The problem is now geometric, not merely statistical. The reward parameter lives in one space. Binary preference moments place it on a distorted nonlinear manifold. A single linear map cannot generally invert that distortion for every possible unseen reward parameter.
Here is the paper’s mechanism in business language:
| Layer | What happens technically | What it means operationally |
|---|---|---|
| Human preference | A latent reward parameter $\theta_i^*$ governs choices | Different users may value the same output features differently |
| Binary comparison | The model observes only which response wins | The feedback records direction but loses preference strength |
| Preference moment | Infinite binary data recovers a nonlinear summary of $\theta_i^*$ | More examples stabilize the wrong-shaped signal |
| In-context decoding | The transformer uses a fixed learned map to infer the query preference | The adapter may work in familiar regions but fail for unseen preference types |
The paper’s Theorem 1 first removes an easy excuse. The training objective is well behaved under its assumptions: strong convexity, a unique minimizer, and convergence of empirical optimization to the population optimum. So the later failure is not about messy optimization or insufficient sample size.
Theorem 2 then gives the unpleasant result: there exist feature and human-type distributions where in-context reward adaptation fails for some unseen human preferences even with infinitely many demonstrations and perfect optimization. This is the theorem that should make product teams slightly less proud of their feedback button.
The impossibility result is the main evidence, not a decorative theorem
Some papers put theory in the middle as academic furniture. Here, the impossibility result is the backbone.
The empirical tables are useful, but they are not the deepest part of the argument. The paper’s strongest contribution is the negative theorem because it changes the diagnosis of failure. Without the theorem, weak out-of-distribution performance could be blamed on too little data, a small transformer, bad training, or unlucky hyperparameters. With the theorem, the diagnosis shifts: binary preference data may not contain enough recoverable information for the desired adaptation task.
That difference matters for business decisions. If the failure is capacity, buy a bigger model. If the failure is data volume, collect more labels. If the failure is signal sufficiency, redesign the feedback interface.
The paper’s logic can be read as a three-step stress test:
| Paper component | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Theorem 1 | Implementation sanity check | The binary-only training problem can be well behaved; failure is not simply optimization collapse | It does not show binary feedback is sufficient |
| Theorem 2 | Main negative evidence | Binary pairwise labels can be structurally insufficient for unseen preference adaptation | It does not say binary labels are useless in all settings |
| Response-time theory | Mechanism repair | Preference strength can restore identifiable information under the model assumptions | It does not prove response time is always practical or clean in real products |
| Synthetic experiment | Controlled validation | The predicted OOD degradation and response-time improvement appear when ground-truth structure is known | It does not establish enterprise deployment performance |
| Food-risk experiment | Behavioral-data extension | The response-time advantage appears on real human choice data, especially with longer in-context sequences | It is not direct evidence from LLM annotation workflows |
This is why a mechanism-first reading is more useful than a scoreboard summary. The paper is not mainly “response time improves accuracy.” It is “binary choices collapse the wrong information; response time partly uncollapses it.” That is a much more expensive sentence, but also a much more useful one.
Response time works because hesitation has information
The paper’s proposed repair is response time. This may sound oddly behavioral for an AI alignment paper, but the move is technically coherent.
The authors use a drift–diffusion model of human decision-making. In such models, choices emerge from an evidence accumulation process. Easy decisions tend to be faster; hard decisions tend to be slower. If a user strongly prefers one output, the decision process reaches a boundary quickly. If the preference is weak, the process wanders longer before crossing a boundary.
Under the paper’s assumptions, combining the preference label with response time yields a signal proportional to the underlying reward difference. That is the critical move. Binary choice alone tells the sign of the preference. Response time helps recover magnitude.
In conceptual form:
$$ \text{choice direction} + \text{decision time} \Rightarrow \text{preference strength signal}. $$
The paper shows that the response-time-augmented target effectively linearizes the information available to the in-context learner. Instead of forcing a linear decoder to invert a nonlinear compressed representation, the model receives a continuous magnitude-sensitive signal. Theorem 3 establishes convergence properties for the response-time-augmented training objective. Corollary 1 states that, under the paper’s assumptions, the trained transformer can correctly adapt in context to an arbitrary unseen human type as the number of demonstrations and annotator samples grows.
This is not a mystical claim that slow users are wise and fast users are decisive. It is narrower: in a modeled decision process, time-to-choice carries information about the strength of evidence behind the choice. For preference modeling, strength is exactly what binary labels tend to erase.
The synthetic results test the theorem under controlled shift
The synthetic experiment is the cleanest empirical validation because the researchers control the latent reward structure.
Human reward parameters are sampled from a mixture of two Gaussian distributions for training. For out-of-distribution evaluation, a new human type is sampled from a third disjoint Gaussian distribution. Response times are simulated from the drift–diffusion process. The authors test both a linear-attention model, matching the theoretical setup, and a 124M-parameter GPT-2 implementation, which checks whether the same qualitative pattern appears in a more standard transformer.
The result is not subtle:
| Setting | Linear attention accuracy | GPT-2 accuracy | Interpretation |
|---|---|---|---|
| Without response time, ID | 0.936 | 0.925 | Binary labels work well when preferences are familiar |
| Without response time, OOD | 0.783 | 0.694 | Accuracy drops sharply under unseen preference shift |
| With response time, ID | 0.878 | 0.905 | ID performance remains strong, though not always higher |
| With response time, OOD | 0.891 | 0.875 | OOD performance improves substantially |
The most important row comparison is not simply “with response time beats without response time.” It is the OOD gap.
For GPT-2, binary-only performance falls from 0.925 in distribution to 0.694 out of distribution. With response time, the OOD result rises to 0.875. For the linear-attention model, response time moves OOD accuracy from 0.783 to 0.891, even while ID accuracy is lower than the binary-only ID result.
That last detail is useful. Response time is not presented as a universal accuracy steroid. In the synthetic table, binary-only feedback can look excellent when the test human comes from the same distribution as training. The value of response time appears most clearly when the model faces an unseen preference type. For business use, that means the feature is more about robustness and adaptation than about polishing already-familiar benchmark cases.
The food-risk experiment is real behavioral evidence, not LLM deployment evidence
The real-world experiment uses a food-risk preference dataset from Smith and Krajbich. It contains binary choices and response times from 42 participants, with each participant answering between 60 and 200 queries. Each query presents two arms, and each arm contains two food items; selecting an arm gives one of those food items uniformly at random. Participants also rated food items on a discrete scale from -10 to 10 before the choice task, and the authors build feature vectors from those ratings plus second-order polynomial features.
This experiment is best read as an applicability check using real human behavioral signals. It is not a direct RLHF annotation dataset for LLM outputs. That boundary matters. Food choices under controlled lab-style conditions are not the same as enterprise users judging legal summaries, code completions, or financial explanations.
Still, the pattern is directionally consistent with the paper’s mechanism. The authors train GPT-2 and evaluate performance under different inference lengths $M$, where $M$ is the number of in-context preference demonstrations available at test time.
| Setting | $M=4$ | $M=8$ | $M=16$ | Interpretation |
|---|---|---|---|---|
| Without response time, ID | 0.606 | 0.679 | 0.675 | More context helps, but performance plateaus unevenly |
| Without response time, OOD | 0.581 | 0.625 | 0.631 | OOD remains weaker than ID |
| With response time, ID | 0.633 | 0.681 | 0.710 | Response time improves ID at all shown lengths |
| With response time, OOD | 0.605 | 0.667 | 0.705 | The OOD gain becomes clearer as context length increases |
The paper’s Figure 1 visualizes the same food-risk experiment using distributions of inference accuracy. Its likely purpose is not to introduce a separate claim, but to make variability and the growing response-time advantage across $M$ easier to see. The table gives the means; the figure shows the spread.
The most business-relevant reading is cautious but positive. With $M=16$, OOD accuracy improves from 0.631 without response time to 0.705 with response time. That is a meaningful directional gain in a small real behavioral dataset. It does not prove that response time will transform every RLHF pipeline. It does show that the paper’s mechanism survives contact with human decision data beyond a synthetic sandbox.
The product lesson is richer feedback design, not just another model architecture
Many AI teams still treat feedback collection as an afterthought: a like button, a thumbs-down reason, maybe a short text box no one reads until a quarterly retrospective. The paper suggests that feedback interface design is part of the model architecture, whether product teams admit it or not.
If a system needs to adapt to unseen users or shifting preference domains, the feedback channel should capture more than a binary winner. Response time is one candidate signal. Others may include confidence ratings, graded preference strength, ranking among multiple options, edit distance between model output and user correction, time spent revising, abandonment behavior, or explicit uncertainty annotations.
Cognaptus inference, not the paper’s direct claim: the broader design principle is to collect feedback that distinguishes direction from intensity.
| Feedback signal | What it captures | Possible business use | Main risk |
|---|---|---|---|
| Binary choice | Which output wins | Simple reward modeling and A/B feedback | Loses preference strength |
| Response time | How easy or difficult the choice was | Personalization under preference shift | Noisy, privacy-sensitive, device-dependent |
| Confidence rating | User’s certainty about judgment | Weighting annotations and routing ambiguous cases | Adds friction and may be gamed |
| Edit behavior | How much the user had to repair output | Workflow-specific quality learning | Hard to normalize across tasks |
| Ranked alternatives | Relative ordering among several outputs | Richer reward learning | More costly annotation |
| Free-text critique | Reason for rejection or preference | Error taxonomy and policy refinement | Expensive to parse reliably |
The obvious objection is that richer feedback costs more. Correct. But collecting cheap feedback that cannot support the adaptation you need is not efficiency. It is deferred measurement debt with a friendly UI.
The right question is not “Should every product collect response time?” The better question is: “Where does preference shift create enough business risk that richer feedback becomes worth the friction?”
For a low-stakes autocomplete tool, binary feedback may be good enough. For an AI assistant used across departments with different compliance thresholds, customer tone standards, or domain-specific risk preferences, binary feedback may be too blunt. The value of richer feedback is highest where unseen user types matter and retraining is slow, expensive, or politically messy.
What this means for RLHF and personalization teams
The paper points toward a practical architecture for adaptive preference systems:
- Keep binary feedback, but stop pretending it is the whole preference signal.
- Add auxiliary behavioral or explicit strength signals where preference heterogeneity matters.
- Train adaptation mechanisms that infer user- or group-level reward structure from in-context demonstrations.
- Evaluate not only in-distribution preference prediction, but also unseen-user and unseen-domain performance.
- Report the OOD gap, not just aggregate accuracy.
That fourth point is where many product dashboards quietly fail. Aggregate satisfaction can look healthy while minority preference groups, new professional segments, or high-risk workflows are poorly modeled. A reward model can be “aligned” to the average and still be wrong for the user who matters in the current workflow. Alignment by average is a polite form of product amnesia.
The paper also gives a useful diagnostic distinction:
| If the system fails under new users… | The default explanation | The paper’s alternative diagnosis |
|---|---|---|
| Predictions degrade for unfamiliar preferences | Need a larger reward model | The feedback signal may be insufficient |
| More binary labels do not fix the issue | Need more annotation volume | More labels may only refine a compressed representation |
| Multi-reward models work for known segments but not new ones | Need more predefined clusters | The system may need in-context adaptation to continuous preference variation |
| Personalization requires retraining | Need better fine-tuning operations | The reward model may need richer demonstrations at inference time |
This is a useful shift for business planning. It moves the investment question from “How many reward models should we train?” to “What feedback schema lets the system infer new preferences without retraining?”
Boundaries: the paper is strong, but not a deployment manual
The paper’s limitations are not decorative, and they should not be hidden in the footnotes like an embarrassing family member.
First, the theory relies on a simplified setting: linear rewards, known feature representations, Bradley–Terry preferences, and a linear-attention transformer. These assumptions are reasonable for theory, but production LLM alignment involves richer outputs, latent features, noisy annotations, strategic users, multi-turn context, and organizational constraints.
Second, response time is informative only if measured carefully. In real products, time-to-choice may reflect hesitation, confusion, distraction, reading speed, network latency, device type, interface layout, accessibility needs, or whether the user’s cat has decided the keyboard is a pillow. Treating response time as pure preference strength would be naive. The signal needs cleaning, calibration, and probably task-specific normalization.
Third, response time may create privacy and governance concerns. Behavioral telemetry can be sensitive. Some users may not expect timing data to influence model training or personalization. If a company uses it, it should be explicit about collection, retention, and purpose.
Fourth, the food-risk dataset supports applicability to real human decisions, not direct transfer to RLHF annotation for language models. The paper’s GPT-2 experiments are useful as architecture checks, but they do not replace experiments on actual LLM preference data with response-time instrumentation.
So the correct business conclusion is bounded: response-time-like signals are promising candidates for improving adaptive reward modeling under preference shift. They are not magic dust sprinkled over a thumbs-up button.
The quiet punchline: feedback quality is alignment infrastructure
The most useful idea in this paper is not that response time is the next great alignment metric. It is that the feedback channel determines what the reward model can possibly learn.
Binary preferences are attractive because they simplify annotation. But simplification is not free. It decides which information survives into the training process and which information is discarded before learning even begins. Once the preference strength is gone, a larger model may only become more sophisticated at decoding an impoverished signal.
For companies building AI assistants, evaluators, copilots, or domain-specific agents, the practical takeaway is clear: treat feedback design as a first-class systems problem. Ask what kind of preference variation the product must handle. Ask whether the collected signal identifies that variation. Ask whether OOD adaptation is being measured separately from ordinary benchmark accuracy.
The lesson is slightly uncomfortable, which is usually a sign that it is useful. Alignment is not only about better optimization after feedback is collected. It is also about not throwing away the part of human judgment that the system later needs to adapt.
A thumbs-up may tell the model who won. It may not tell the model why the win mattered, how much it mattered, or whether a new user would have judged the same pair differently. For robust preference modeling, that missing distance is not a rounding error. It is the plot.
Cognaptus: Automate the Present, Incubate the Future.
-
Zhenyu Sun, Zheng Xu, and Ermin Wei, “In-Context Reward Adaptation for Robust Preference Modeling,” arXiv:2605.30323, 2026. https://arxiv.org/abs/2605.30323 ↩︎