A customer tells your AI assistant that she dislikes crowded tourist attractions. Three weeks later, she asks for a weekend itinerary.
A good assistant should not proudly recommend the busiest landmark in the city.
A less good assistant will do exactly that, but in a warm tone.
This is the quiet failure mode behind many “personal AI” demos. The interface remembers the conversation. The product claims continuity. The model may even have a giant context window large enough to swallow a small novel. Yet when the user asks a new question, the system behaves as if the earlier preference is just decorative text floating somewhere in the attic.
The paper behind RealPref, Towards Realistic Personalization: Evaluating Long-Horizon Preference Following in Personalized User-LLM Interactions, gives this problem a more disciplined test.1 Its main contribution is not simply another benchmark. The useful part is sharper: it separates “the model saw the preference” from “the model can detect, retain, retrieve, apply, and generalize that preference when it matters.”
That distinction is where many assistant products quietly lose the plot.
The easy misconception: put more history in the prompt
The most tempting diagnosis is also the laziest one: LLMs forget because we do not give them enough memory.
So the obvious product fix becomes: store more conversation history, extend the context window, maybe attach a vector database, and call the result personalization. Investors nod. Users get a settings page. Everyone pretends memory has been solved.
RealPref makes that view look incomplete.
The benchmark is built around long-horizon user–LLM interaction histories. It contains 100 synthetic user profiles, 1,300 preference-query pairs, four types of preference expression, and contexts ranging from short histories around 2K tokens to extreme histories around 247K tokens. The important design choice is that preferences are not always handed to the model as neat instructions. They may appear directly, indirectly, stylistically, or through experience feedback spread across sessions.
That matters because real users rarely behave like benchmark designers.
They do not always say, “I prefer boutique hotels over chain hotels because I value local authenticity.” Sometimes they say they hated their last generic hotel, enjoyed talking with a family-run host, found the neighborhood more memorable than the amenities, and later ask where to stay in Japan. A human travel adviser can connect the pattern. A model may see all the words and still miss the preference.
So the RealPref question is not: can the model read a long transcript?
It is: can the model convert scattered user history into useful preference-aware behavior at the right moment?
Those are very different capabilities. The first is context ingestion. The second is personalization.
RealPref tests the preference lifecycle, not just recall
The benchmark’s structure is useful because it turns personalization into a sequence of operational failures.
A personalized assistant must do at least five things:
| Capability | What the assistant must do | Typical failure |
|---|---|---|
| Detection | Notice that a preference exists | Treat preference signals as casual conversation |
| Interpretation | Understand what the preference means | Over-literal reading or missing the reason behind the preference |
| Retrieval | Bring the relevant signal back when needed | Lose the preference inside long irrelevant context |
| Application | Shape the answer around the preference | Give a generic answer with no adaptation |
| Generalization | Extend the preference carefully to related cases | Either fail to infer or infer too aggressively |
RealPref’s dataset design maps onto this lifecycle.
Each synthetic user has a profile, demographic information, a biography, life events, and preference-related conversations. The original preferences are generated to be diverse, persona-related, unique, value-free, and complete. That “value-free” requirement is important. If every preference is socially obvious—healthy food, clean hotels, responsible spending—the model can pass by guessing what a sensible person might want. Real personalization starts where common sense is no longer enough.
The paper also distinguishes original preferences from generalized preferences. Original preferences are expressed in the dialogue. Generalized preferences are not directly stated but can be rationally inferred from the reasons behind earlier preferences. For example, someone who prefers unconventional food, boutique accommodations, and experimental makeup may also prefer independent cinema over formulaic blockbusters. That is not a memorization test. It is a controlled test of preference reasoning.
A production assistant faces this constantly. A user who dislikes crowded travel spots may also prefer quieter restaurants, off-peak scheduling, and small-group tours. But the assistant must infer this carefully. Too little generalization makes the system forgetful. Too much generalization makes it creepy. The line between “thoughtful” and “why are you watching me?” is thinner than product copy usually admits.
Four preference expressions expose where models stop being attentive
RealPref uses four expression types, moving from easy to difficult:
| Preference expression type | What it looks like | Main capability tested |
|---|---|---|
| Explicit direct statement | “I prefer X over Y.” | Basic extraction |
| Explicit contextualized mention | Preference appears naturally inside a broader conversation | Extraction under distraction |
| Implicit stylistic expression | Preference is implied through contrast, emotion, or rhetorical cues | Pragmatic interpretation |
| Implicit experience feedback | Preference emerges across multiple sessions and experiences | Temporal integration |
This is the mechanism-first heart of the paper.
Direct statements are close to ordinary instruction following. A model sees the preference, stores it temporarily in its attention pattern, and can often use it. Contextualized mention is harder because the preference is surrounded by other material. Stylistic expression is harder again because the model must infer attitude without a clean “I prefer” marker. Experience feedback is the most realistic and the most painful: the preference is distributed across time.
The paper reports a clear decline from explicit to implicit expression types across the open-ended evaluation dimensions. The decline is especially meaningful because open-ended answers require the model to proactively produce a preference-aligned response, not merely pick the least-wrong option.
This finding should worry anyone building AI assistants for advisory work, sales, education, health coaching, travel planning, wealth management, or customer success.
In those domains, users often reveal preferences through stories. “The last consultant gave me a 60-page report and I never used it” is a preference signal. “I tried budgeting apps before, but they made me feel guilty” is a preference signal. “My team hates dashboards that require manual updating” is a preference signal. These are not formal requirements. They are lived constraints.
A model that only responds to explicit preference syntax will look intelligent during demos and clumsy in real relationships. Very modern. Very expensive. Still clumsy.
Multiple choice makes personalization look easier than it is
One of the paper’s best design choices is its comparison across question types: multiple choice, true-or-false, and open-ended generation.
This is not a minor evaluation detail. It changes what the benchmark is actually measuring.
Multiple-choice questions can overestimate personalization because the model may exploit option patterns. If three options are routine gym workouts and one option is spontaneous community dancing, the odd option may stand out even if the model has not properly retrieved the user’s dislike of monotonous exercise. The model appears personalized, but it is partly solving a test-taking problem.
True-or-false questions reduce this shortcut by presenting only one candidate option. The model cannot compare four answers and pick the strange one. It must decide whether the option fits the user.
Open-ended generation is harder still. The model gets the query and must produce a useful response that reflects the user’s preference. It has to retrieve the relevant preference, interpret it, and express it in a helpful answer.
The paper uses three dimensions for open-ended evaluation:
| Dimension | What it checks | Why it matters in business use |
|---|---|---|
| Preference awareness | Whether the answer recognizes the user preference | Shows whether the system surfaced the right memory |
| Preference alignment | Whether the recommendation actually fits the preference | Shows whether memory changed the answer |
| Answer quality | Whether the response remains useful and constructive | Prevents “personalized but useless” output |
This is a useful correction to simplistic AI evaluation.
A chatbot can pass a preference test by mentioning the user’s preference while still giving weak advice. Another chatbot can give useful advice that accidentally violates the user’s preference. A real assistant needs both alignment and usefulness. “I remember you hate crowds, so here are three famous overcrowded attractions with crowd-management tips” is not success. It is failure wearing a name tag.
For business teams, this means personalization evaluation should not be a single binary metric. It needs separate checks for memory retrieval, behavioral alignment, and answer usefulness. Otherwise the test will reward systems that sound personal without being personal.
Long context creates dilution, not just capacity pressure
RealPref’s context configurations are especially important because they separate two mechanisms: total context length and distance from the useful preference signal.
The benchmark includes short “simple” contexts around 2K tokens, then longer contexts around 37K, 72K, 142K, and an extreme setting around 247K tokens. Random conversations are inserted either within the history or appended near the end to increase the distance between the preference expression and the final query.
The result is intuitive but still worth spelling out: preference following degrades as context grows. More specifically, the paper reports declines in preference awareness and preference alignment as histories become longer. The extended setup also suggests that distance from the preference signal matters, not only total token count.
This is the difference between storage and salience.
A long context window means the model can technically receive a large amount of text. It does not guarantee that the right part of the text will dominate the answer at the right time. In a long customer relationship, most prior conversation is irrelevant to the current task. The useful signal is often one sentence, one complaint, one preference, or one repeated pattern buried inside months of interaction.
That is not a context-window problem alone. It is an attention-allocation problem.
For product architecture, the implication is straightforward: dumping raw history into the prompt is a weak personalization strategy. It is expensive, brittle, and hard to audit. The model may have access to everything and still behave as if nothing matters.
A better architecture treats memory as a managed asset:
| Layer | Function | Practical design question |
|---|---|---|
| Raw history | Preserve interaction records | What did the user actually say? |
| Preference extraction | Convert dialogue into candidate preferences | What stable signals can be inferred? |
| Preference store | Keep structured, updateable memory | What should persist across sessions? |
| Retrieval | Surface relevant memories for the current task | Which preferences matter now? |
| Response policy | Apply preferences without overreaching | How should the answer change? |
| Audit trail | Explain why a preference was used | Can the user inspect or correct it? |
This is less glamorous than saying “we use a million-token model.” It is also closer to how a serious product should work.
Reminders help because models are not always proactive
One of the paper’s more practical findings is that simple reminder prompts can help substantially.
The tested improvement methods include:
| Method | Likely purpose in the experiment | What it supports | What it does not prove |
|---|---|---|---|
| Reminder | Test whether explicit instruction to recall preferences improves behavior | Some failures come from lack of proactivity, not total inability | A reminder alone is not a robust memory system |
| Few-shot chain-of-thought | Show examples of preference-aware answers | Demonstrates that response style and reasoning pattern can be induced | Does not guarantee retrieval from very long histories |
| Retrieval-augmented generation | Provide the top relevant history turns | Tests whether external retrieval improves preference following | Retrieval quality itself becomes a system dependency |
In contexts within the model’s retrieval capability, Reminder, Few-shot CoT, and RAG show similar improvement effects for stronger models. The simplest intervention can compete with more elaborate prompting. That sounds almost too cheap, which is why it is worth interpreting carefully.
A reminder does not create memory. It changes the model’s behavior by making preference recall part of the immediate task. The model may already have enough information in context, but without a reminder it does not actively use it. In other words, the model is not only limited by what it knows. It is limited by what it decides to attend to.
That is a product lesson hiding inside an evaluation result.
Many assistant failures are not caused by missing data. The user preference is somewhere in the transcript. The problem is that the system never converted it into a live constraint for generation. A reminder prompt can partially patch that failure, but it is not enough when histories become too long or preference signals become too buried.
In the extreme context setting, RAG becomes more valuable. That is also predictable: when the full history exceeds the model’s effective retrieval ability, external retrieval can move the relevant memory from the attic to the desk.
The business inference is not “always use RAG.” The better inference is conditional:
| Situation | Likely best intervention |
|---|---|
| Preference is explicit and recent | Reminder may be enough |
| Preference is explicit but buried in long history | Retrieval is likely needed |
| Preference is implicit across sessions | Extraction and summarization become critical |
| Preference must generalize to a new domain | Reasoning policy and guardrails matter |
| Preference may be sensitive or outdated | User control and verification are required |
A mature personalization system should not choose between prompt reminders and retrieval as if they are rival religions. It should use both, plus structured memory, plus update logic. Architecture is usually boring until it saves the product.
Generalization is the difference between memory and understanding
RealPref’s generalized preference tests are important because they move beyond “remember what the user said.”
The model sees original preference expressions in the context. Then it must answer queries involving generalized preferences that were not directly stated but can be inferred from similar reasons. The paper reports that models perform worse on generalized preference queries than on original preference queries in the same zero-shot setting. Reminder improves performance, but generalized preference following remains a harder problem, especially for stronger models where generalized preference scores still trail original preference scores.
This result should shape how businesses define personalization.
A weak assistant can repeat stored preferences.
A better assistant can apply stored preferences.
A genuinely useful assistant can infer adjacent preferences carefully, explain the inference, and let the user correct it.
That last part matters. Generalization is where personalization becomes valuable—and risky. If a system only repeats known facts, it becomes a searchable notebook. If it generalizes too freely, it becomes a stereotype machine with a friendly interface.
For enterprise systems, preference generalization should therefore be treated as a policy-controlled inference layer, not a casual byproduct of model creativity.
A reasonable assistant might say:
“Since you usually prefer quieter, locally run places over crowded tourist spots, I’d start with smaller neighborhood restaurants. I may be overgeneralizing, so tell me if this trip is different.”
That response does three things: it uses memory, applies it to a new case, and keeps the inference corrigible. This is not just good manners. It is risk control.
In regulated or sensitive domains, the stakes rise. A financial assistant should not infer risk tolerance from a few casual remarks and quietly adjust portfolio advice. A health assistant should not infer medical preferences from emotional tone alone. A workplace assistant should not infer employee personality traits and route tasks accordingly. Personalization without boundaries becomes profiling. Profiling with a chatbot smile is still profiling.
What RealPref directly shows, and what Cognaptus infers
It is useful to separate the paper’s direct evidence from the business interpretation.
| Claim | Evidence from RealPref | Business meaning | Boundary |
|---|---|---|---|
| MCQ can overstate preference-following ability | MCQ scores are high and less discriminative; true-or-false and open-ended tasks are more revealing | Do not evaluate personalization with easy recognition tasks only | The paper’s exact option artifacts are benchmark-specific |
| Implicit preferences are harder | Scores decline from explicit statements to implicit stylistic and experience-feedback expressions | Assistants need preference extraction beyond direct user declarations | Synthetic dialogues may not cover all real conversational styles |
| Long context degrades preference use | Performance drops as context grows and as useful signals are placed farther from the query | Large context windows do not replace memory architecture | Figure results are directional; deployment behavior depends on model and system design |
| Reminder helps | Reminder prompts improve performance in several settings | Some failures are due to lack of proactivity | Prompting is a patch, not governance |
| RAG helps in longer contexts | Retrieval improves performance when context exceeds effective retrieval capability | External memory and retrieval are operationally important | Retrieval quality, indexing, privacy, and recency remain unresolved |
| Generalization remains difficult | Models perform worse on generalized preferences than original ones | Personalization requires cautious inference, not just stored memory | Over-generalization can create user trust and compliance risks |
The paper directly shows benchmark behavior under controlled synthetic conditions. Cognaptus’s business inference is that personalization should be engineered as a lifecycle: detect, store, retrieve, apply, generalize, and audit.
That is a heavier product requirement than “add memory.”
It also explains why many AI assistant demos feel better than the deployed product. Demos use short, clean, recent signals. Real users create messy, old, contradictory, evolving signals. A benchmark that stresses long-horizon preference following is therefore closer to the product problem than a benchmark that asks the model to obey one fresh instruction at the bottom of the prompt.
The boundaries: synthetic users, judge models, and changing preferences
RealPref is useful, but it is not the final word on personalization.
First, the dataset is synthetic. Synthetic data enables control, scale, and privacy protection. It also risks smoothing out the messiness of human behavior. Real people contradict themselves, change their minds, express preferences differently across cultures, and sometimes say things they do not actually want the assistant to remember.
Second, open-ended evaluation relies on an LLM judge with predefined rubrics. The rubrics are more granular than a binary score, which is a strength. Still, LLM judging is not identical to human judgment, especially when the question involves taste, social nuance, privacy expectations, or whether a response feels uncomfortably personal.
Third, the benchmark focuses on preferences actively shared through dialogue. That is a sensible boundary. But deployed systems often combine chat history with clicks, purchases, location, device behavior, uploaded files, CRM records, and third-party data. Once those signals enter the system, the personalization problem becomes not only technical but contractual: what did the user consent to, what can they inspect, and what can they delete?
Fourth, preferences in RealPref are not primarily dynamic. In reality, preference memory needs versioning. A user may dislike hostels at age 25 and prefer boutique hotels at 35. A sales lead may initially care about price, then shift toward reliability after a failed implementation. A customer may reject automation until staffing pressure changes the economics. Personalization systems need decay, conflict resolution, and explicit update mechanisms.
A memory that never changes is not personalization. It is fossilization.
The business value is diagnosis, not a magical assistant
The most useful contribution of RealPref is diagnostic.
It gives teams a way to ask: where is our assistant failing?
Is it failing because it cannot identify implicit preferences? Because it loses relevant information in long context? Because it retrieves the wrong memory? Because it remembers but does not apply? Because it applies too literally? Because it cannot generalize? Because it generalizes without permission?
Those are different product failures. They require different fixes.
A CRM copilot that forgets a client prefers concise proposals needs a retrieval and memory layer. A travel assistant that notices “quiet places” but recommends isolated locations unsafe for the user has an application-quality problem. A financial assistant that infers risk appetite from casual language has a governance problem. A tutoring assistant that remembers a student dislikes long explanations but never adapts lesson structure has an alignment-to-output problem.
RealPref’s mechanism-first lesson is therefore simple:
Personalization is not a feature. It is a pipeline.
And every stage of that pipeline can break.
The next generation of AI assistants will not become personal merely because models get larger or context windows get longer. They will become personal when systems learn how to transform user history into governed, relevant, updateable, and inspectable constraints on future responses.
Until then, the assistant may remember your words.
It may still forget what they mean.
Cognaptus: Automate the Present, Incubate the Future.
-
Qianyun Guo, Yibo Li, Yue Liu, and Bryan Hooi, “Towards Realistic Personalization: Evaluating Long-Horizon Preference Following in Personalized User-LLM Interactions,” arXiv:2603.04191. ↩︎