Preference Signals, Not Preference Theater
Businesses are currently learning an expensive lesson: user behavior is not the same thing as user preference.
A person clicks because the button was large. A driver brakes because the situation was unclear. A customer accepts a chatbot answer because the refund is small and arguing is tedious. A manager approves a workflow because the dashboard made the alternative invisible. The log file looks objective. It is also quietly contaminated by habit, uncertainty, exploration, friction, fatigue, and the occasional human desire to end the meeting before lunch.
This matters because preference alignment is moving from research vocabulary into operational design. AI systems are no longer only predicting the next word or classifying an image. They are being asked to recommend, route, negotiate, schedule, prioritize, drive, and decide. Once a system starts choosing among acceptable options, the question changes from “Can it imitate what happened?” to “Does it understand what should be preferred?”
Two recent arXiv papers make this distinction unusually clear from opposite ends of the problem. Learning the Preferences of a Learning Agent gives the theoretical warning: if the observed person or agent is still learning, behavior alone may not reveal the full structure of their preferences.1 VL-DPO: Vision-Language-Guided Finetuning for Preference-Aligned Autonomous Driving gives the engineering response: instead of relying only on imitation, use a vision-language model as an offline critic to generate trajectory-level preference comparisons, then train a specialized motion forecasting model with Direct Preference Optimization.2
The useful article here is not “Paper A says X; Paper B says Y.” That would be tidy, harmless, and not very useful. The stronger reading is a logic chain:
- Preference learning is an information problem.
- Behavior logs may not contain enough information.
- Comparative feedback can expose what imitation hides.
- Proxy critics, including VLMs, can scale comparison generation.
- But the proxy is not magic; it simply moves the validation problem upstream.
That is the whole story. The rest is detail, and detail is where most AI products quietly lose money.
The mistake: treating behavior as preference
A common business assumption is that observed behavior reveals preference. If users repeatedly choose option A, then they must prefer A. If an expert agent takes action B, then B is the target. If a driver’s recorded trajectory curves in a certain way, then the model should learn that curve.
This assumption is often tolerable when the environment is simple and the actor is competent, stable, and well-informed. It becomes fragile when the actor is still learning.
The Berkeley paper formalizes this issue by studying a predictor that observes a learner acting online and tries to infer the learner’s underlying reward function. The learner is not assumed to be already optimal. It may be a no-regret learner, or it may behave according to a Boltzmann-rational policy that gradually improves as its internal estimate of value improves.
The important distinction is not mathematical decoration. It is the difference between two questions:
| Question | What it asks | Why it matters |
|---|---|---|
| Best-response recovery | Can we identify the action that looks optimal? | Useful for copying what seems best now. |
| Reward-structure recovery | Can we infer how alternatives are ranked and valued? | Necessary for generalizing across new situations. |
The paper’s warning is sharp: under a no-regret learner model, the best we can generally hope to recover is the preferred action in a state, not the full preference structure behind it. In plain English, watching someone eventually choose well does not tell us how they rank the options they did not choose.
This is irritating for anyone who has built a product analytics dashboard and called it “user insight.” Logs can show what happened. They do not automatically explain the latent preference landscape.
The paper also gives a more hopeful but conditional result. Under a Boltzmann-rational learner model, where actions reveal graded information through stochastic choice probabilities, stronger recovery guarantees become possible under specific assumptions. The key phrase is “under specific assumptions.” In the paper’s own conclusion, the authors note that choosing the learner model and the evaluation measure is fundamental, and that Boltzmann rationality itself is not a complete model of bounded human rationality.
That is the first half of the chain: behavior is a narrow supervision channel. Sometimes it is enough to recover what action looks best. Often it is not enough to recover why.
The engineering answer: make the signal richer
The VL-DPO paper starts in a very different place: autonomous driving.
The authors focus on motion forecasting for an ego vehicle. Standard imitation learning trains a model to reproduce expert-demonstrated trajectories, usually through objectives that reward local geometric accuracy. This is useful, but it does not necessarily capture human driving preferences such as safe spacing, smoothness, courtesy, situational confidence, and avoiding aggressive maneuvers.
Their solution is not to put a giant vision-language model inside the car and hope it reasons gracefully at highway speed. That would be dramatic. It would also be a latency and reliability headache wearing a lab coat.
Instead, they use a modular architecture:
| Component | Role in the system |
|---|---|
| Pretrained motion forecasting model | Generates candidate ego-vehicle trajectories. |
| Frozen VLM | Acts offline as a zero-shot driving reasoner and preference annotator. |
| BEV and camera context | Give the VLM spatial, visual, temporal, and route information. |
| Preference-pair construction | Converts the VLM’s selected trajectory into comparisons against unselected candidates. |
| DPO finetuning | Trains the motion model to prefer the VLM-selected trajectory over alternatives. |
| Runtime deployment | Uses the specialized motion model, not the VLM critic. |
This is the practical leap. The paper does not merely ask the model to imitate one demonstrated trajectory. It samples 12 candidate trajectories, asks the VLM to choose the preferred one using scene context, then constructs up to 11 preference pairs per scene: the selected trajectory versus each unselected trajectory.
That matters because comparison teaches what imitation usually omits. Imitation says, “Do this.” Preference comparison says, “Do this rather than those, and here are several examples of what not to do.” For alignment, the second signal is often more valuable.
The paper’s results support this hierarchy of signals. On the Waymo Open End-to-End Driving Dataset, the authors report that the VLM-selected trajectory slightly improves Rater Feedback Score and substantially reduces Average Displacement Error compared with the model’s own most-likely rollout. More importantly, after finetuning, the IL+VL-DPO model achieves the highest RFS among the tested variants, improving RFS by 11.94% over the MotionLM baseline and by 4.07% over imitation learning, while reducing ADE by 10.01% and 6.5% respectively.
The ablation is more interesting than the headline number. High-level action supervision helps when used as input, especially for geometric accuracy. But the strongest preference alignment comes from trajectory-level comparative feedback. The paper’s own framing is useful: high-level actions answer “what maneuver should be executed,” while preference pairs answer “how should it be executed safely and comfortably.”
That is the second half of the chain: if behavior logs do not reveal enough preference structure, create a richer comparison channel.
The combined lesson: alignment is constrained by signal content
The two papers are complementary because one explains the limit and the other demonstrates a workaround.
The theory paper says: do not assume observed behavior contains the full preference structure, especially when the actor is learning. The driving paper says: then do not rely only on observed behavior; generate explicit comparisons among plausible alternatives.
Put together, they suggest a clean design principle:
Preference alignment improves when the supervision channel reveals meaningful distinctions among alternatives, not merely when it records more examples of chosen behavior.
This is not a slogan. It is a product architecture principle.
Consider four possible supervision channels:
| Supervision channel | What it captures | What it misses | Typical business risk |
|---|---|---|---|
| Raw behavior logs | What users did | Whether they endorsed it | Optimizing for friction, habit, or manipulation |
| Imitation learning | A demonstrated positive example | Why alternatives are worse | Brittle replication under new conditions |
| Human preference labels | Direct comparative judgment | Costly, sparse, sometimes noisy | Limited coverage and slow iteration |
| Proxy-generated comparisons | Scalable preference structure | Proxy misalignment | False confidence if not validated |
The point is not that proxy-generated comparisons are always better. They are not. A VLM critic can be wrong, biased, overconfident, or blind to domain-specific constraints. The point is narrower and more useful: comparison data has higher alignment bandwidth than isolated demonstrations, provided the comparison source is validated.
That small clause—“provided the comparison source is validated”—is where serious systems are separated from PowerPoint systems.
What the papers show, and what business should infer
The papers do not prove that VLMs understand human values. They do not prove that synthetic preference labels can replace human judgment in every domain. They certainly do not prove that every workflow can be aligned by sprinkling DPO over it like machine-learning parmesan.
What they do show is more specific and more useful.
| Layer | What the papers show | Business interpretation |
|---|---|---|
| Theory | Preference recovery depends on assumptions about the learner and the evaluation metric. | Before learning from behavior, define what kind of preference knowledge you actually need. |
| Data | Behavior alone may identify preferred actions without revealing full preference structure. | Logs should be treated as evidence, not truth. |
| Engineering | VLM-generated trajectory comparisons can improve preference alignment in an autonomous-driving motion model. | Proxy critics can be useful offline signal generators when direct human labels are sparse. |
| Architecture | A frozen VLM can supervise a smaller specialized model without being deployed at inference time. | Expensive reasoning models may be more valuable as training-time auditors than runtime engines. |
| Governance | Proxy feedback must be checked against human preferences and domain metrics. | Alignment pipelines need validation loops, not just larger annotation pipelines. |
For business systems, the implication is straightforward: stop asking only “Do we have enough data?” Ask “Does this data contain the distinctions the model needs to learn?”
A million support tickets may teach a chatbot how agents historically responded. They may not teach when a customer would prefer a refund, a credit, an apology, or escalation. Procurement logs may teach which suppliers were chosen. They may not reveal which alternatives were rejected because of delivery risk, hidden relationship constraints, or a manager’s outdated spreadsheet superstition. Recommendation clicks may reveal curiosity, not satisfaction.
The model does not know the difference unless the supervision signal exposes it.
A practical framework: the preference signal audit
A business team planning preference-aligned AI should run a preference signal audit before choosing the training method. The audit is simple, but not comfortable, which is probably why it is useful.
| Audit question | Bad answer | Better answer |
|---|---|---|
| What decision is the AI making? | “It recommends things.” | “It ranks options under constraints A, B, and C.” |
| What does the behavior log actually record? | “User preference.” | “Observed action under interface, time, and information constraints.” |
| Are users or experts still learning? | “Probably, but logs are logs.” | “Yes, so early behavior may reflect exploration or confusion.” |
| Do we need best-action imitation or full preference ranking? | “Whichever improves accuracy.” | “The deployment setting requires ranking acceptable alternatives.” |
| Can we generate comparisons? | “Human labels are too expensive.” | “Use human labels for calibration and proxy critics for scale.” |
| How will proxy feedback be validated? | “The model seems reasonable.” | “Benchmark proxy choices against human raters and downstream business metrics.” |
This framework also clarifies when DPO-like methods are relevant. DPO is not a magic alignment button. It is useful when the organization can construct meaningful preference pairs: chosen versus rejected, acceptable versus risky, better escalation versus worse escalation, safe trajectory versus technically possible but uncomfortable trajectory.
Without meaningful comparisons, preference optimization becomes ceremony. The model is not being aligned; it is being given a more fashionable loss function.
The autonomy lesson: do not deploy the philosopher in the engine room
One of the more business-relevant ideas in VL-DPO is architectural, not algorithmic.
The VLM is used offline. It reasons over camera context, BEV visualizations, route information, speed, and candidate trajectories. It generates supervision. Then the specialized motion forecasting model is finetuned and deployed without requiring the VLM at inference time.
This design deserves attention beyond autonomous driving.
Many companies are currently tempted to put the largest available model directly into every operational loop. That can work for low-risk, low-frequency decisions. But for high-volume or latency-sensitive systems, a better pattern is often:
- Use a powerful model as a teacher, critic, evaluator, or preference annotator.
- Convert its judgments into structured training data.
- Train or tune a smaller task model.
- Deploy the smaller model.
- Audit the deployed model against human and business metrics.
This is not glamorous. It is, however, cheaper, faster, and easier to govern. A rare combination. We should enjoy it while it lasts.
The same pattern can apply outside driving:
| Domain | Candidate outputs | Preference critic | Deployable model |
|---|---|---|---|
| Customer service | Possible replies or escalation paths | Human reviewers plus LLM critic | Smaller support-routing model |
| Procurement | Supplier shortlists | Policy-aware evaluator plus buyer review | Vendor-ranking model |
| Finance operations | Exception-handling actions | Compliance critic plus analyst labels | Workflow decision model |
| Sales operations | Lead-prioritization sequences | Manager review plus model critique | Lead-scoring model |
| Internal knowledge work | Draft answers or research paths | Expert reviewer plus LLM comparison | Retrieval and response policy |
The important phrase is not “LLM critic.” It is “plus validation.” A proxy critic can expand coverage, but it should not be allowed to quietly become the constitution.
Where the tension remains
The two papers also expose a tension that should not be smoothed away.
The theory paper is cautious about what can be inferred from behavior. Its central warning is epistemic: unless the learner model and evaluation metric justify the inference, preference recovery may be underidentified.
The driving paper is optimistic about a practical workaround. It shows that a VLM can create useful preference comparisons in a specific autonomous-driving setup and that those comparisons can improve a motion model.
These are not contradictory. They simply sit at different levels.
The theoretical paper asks: under what assumptions is preference recovery possible?
The engineering paper asks: can we build a richer supervision channel that works empirically in this domain?
A mature business reading needs both. Theory prevents overclaiming. Engineering prevents paralysis.
The danger is to read the driving result as “VLMs can replace human preference labels.” The paper does not justify that general claim. In its setting, VLM-generated comparisons were useful and even outperformed limited human preference DPO on RFS, while human preference DPO achieved lower ADE in the reported ablation. That is a domain-specific result shaped by dataset design, metric definitions, candidate generation, the VLM prompt, and the number of comparisons available per scene.
The better conclusion is more disciplined: when human labels are sparse, structured proxy comparisons may provide a richer training signal than a small set of direct preference pairs, but only after calibration against human judgment and downstream task metrics.
That sentence is less viral than “VLMs replace annotators.” It is also less wrong.
The measurable business value: less imitation, more preference design
For managers, the value of this paper cluster is not that it introduces another acronym. The value is that it gives a practical distinction between three levels of AI improvement:
| Level | Goal | Typical method | Failure mode |
|---|---|---|---|
| Prediction | Match observed outcomes | Supervised learning | Learns historical artifacts |
| Imitation | Copy demonstrated good behavior | Fine-tuning on expert examples | Misses why alternatives are worse |
| Preference alignment | Rank and choose among acceptable alternatives | Comparative feedback and preference optimization | Depends on quality of preference signal |
Most enterprise AI projects are still stuck between prediction and imitation. They learn what historically happened, then hope that historical behavior equals desired behavior. Sometimes it does. Often it equals “what the organization tolerated under old constraints.”
Preference design asks a different set of questions:
- What alternatives should the model compare?
- Who or what judges the comparison?
- What dimensions define “better”?
- Which judgments require humans?
- Which judgments can be scaled by a proxy?
- How do we detect proxy drift?
- How do we measure downstream harm, not just offline score improvement?
These questions are less convenient than “upload the logs and fine-tune.” They are also closer to how real automation fails.
A model that learns to answer tickets faster but escalates fewer borderline cases may look efficient until customer churn increases. A model that learns to rank suppliers by historical acceptance may reinforce a procurement bottleneck. A driving model that optimizes geometric accuracy may still feel uncomfortable or unsafe to human raters. Local accuracy is not the same as aligned behavior.
The quiet strategic implication
The strongest strategic implication is that preference data should be designed, not merely collected.
Many firms already have logs. Fewer have structured comparisons. Even fewer have calibrated proxy critics that can generate comparisons at scale and be checked against human raters. That gap is an opportunity.
A practical AI roadmap should separate three assets:
- Behavioral records — what happened.
- Preference comparisons — what was better and worse among alternatives.
- Validation standards — how the organization knows the preference signal is trustworthy.
The third asset is the least glamorous and the most defensible. Anyone can collect logs. Many can prompt a model. Fewer can design a preference validation loop that survives contact with operations, regulation, and unhappy customers.
This is where Cognaptus-style automation should be careful. The value is not in saying “AI understands preferences.” It does not, at least not in the casual magical sense. The value is in building workflows where preference information is made explicit enough for models to learn from and constrained enough for businesses to govern.
Conclusion: preference alignment is an information-quality problem
The two papers together point to a sober conclusion.
The Berkeley paper shows why observed behavior may be insufficient: if the actor is still learning, behavior may reveal the best-looking action without revealing the underlying preference structure. The VL-DPO paper shows a concrete engineering pattern for enriching the supervision channel: use a frozen VLM offline to create structured trajectory comparisons, then use DPO to train a specialized model.
The combined message is not “more data.” It is “better preference signal.”
For businesses, that changes the project plan. Do not begin with the model. Begin with the decision. Then identify the alternatives, the preference dimensions, the comparison source, the validation method, and only then the optimization technique.
Otherwise, the organization may build a very sophisticated system that faithfully imitates yesterday’s compromises. And because it has a transformer in the loop, everyone will call it innovation.
Cognaptus: Automate the Present, Incubate the Future.
-
Karim Abdel Sadek, Mark Bedaywi, Rhys Gould, and Stuart Russell, “Learning the Preferences of a Learning Agent,” arXiv:2605.09217v1, 09 May 2026. https://arxiv.org/abs/2605.09217 ↩︎
-
Zhefan Xu, Ghassen Jerfel, Marina Haliem, Qi Zhao, Jeonhyung Kang, and Khaled S. Refaat, “VL-DPO: Vision-Language-Guided Finetuning for Preference-Aligned Autonomous Driving,” arXiv:2605.20082v1, 19 May 2026. https://arxiv.org/abs/2605.20082 ↩︎