Memory Isn’t Personal: Why LLMs Still Forget What You Like

A customer tells your AI assistant that she dislikes crowded tourist attractions. Three weeks later, she asks for a weekend itinerary.

A good assistant should not proudly recommend the busiest landmark in the city.

A less good assistant will do exactly that, but in a warm tone.

This is the quiet failure mode behind many “personal AI” demos. The interface remembers the conversation. The product claims continuity. The model may even have a giant context window large enough to swallow a small novel. Yet when the user asks a new question, the system behaves as if the earlier preference is just decorative text floating somewhere in the attic.

The paper behind RealPref, Towards Realistic Personalization: Evaluating Long-Horizon Preference Following in Personalized User-LLM Interactions, gives this problem a more disciplined test.¹ Its main contribution is not simply another benchmark. The useful part is sharper: it separates “the model saw the preference” from “the model can detect, retain, retrieve, apply, and generalize that preference when it matters.”

That distinction is where many assistant products quietly lose the plot.

The easy misconception: put more history in the prompt

The most tempting diagnosis is also the laziest one: LLMs forget because we do not give them enough memory.

So the obvious product fix becomes: store more conversation history, extend the context window, maybe attach a vector database, and call the result personalization. Investors nod. Users get a settings page. Everyone pretends memory has been solved.

RealPref makes that view look incomplete.

The benchmark is built around long-horizon user–LLM interaction histories. It contains 100 synthetic user profiles, 1,300 preference-query pairs, four types of preference expression, and contexts ranging from short histories around 2K tokens to extreme histories around 247K tokens. The important design choice is that preferences are not always handed to the model as neat instructions. They may appear directly, indirectly, stylistically, or through experience feedback spread across sessions.

That matters because real users rarely behave like benchmark designers.

They do not always say, “I prefer boutique hotels over chain hotels because I value local authenticity.” Sometimes they say they hated their last generic hotel, enjoyed talking with a family-run host, found the neighborhood more memorable than the amenities, and later ask where to stay in Japan. A human travel adviser can connect the pattern. A model may see all the words and still miss the preference.

So the RealPref question is not: can the model read a long transcript?

It is: can the model convert scattered user history into useful preference-aware behavior at the right moment?

Those are very different capabilities. The first is context ingestion. The second is personalization.

RealPref tests the preference lifecycle, not just recall

The benchmark’s structure is useful because it turns personalization into a sequence of operational failures.

A personalized assistant must do at least five things:

Capability	What the assistant must do	Typical failure
Detection	Notice that a preference exists	Treat preference signals as casual conversation
Interpretation	Understand what the preference means	Over-literal reading or missing the reason behind the preference
Retrieval	Bring the relevant signal back when needed	Lose the preference inside long irrelevant context
Application	Shape the answer around the preference	Give a generic answer with no adaptation
Generalization	Extend the preference carefully to related cases	Either fail to infer or infer too aggressively

RealPref’s dataset design maps onto this lifecycle.

Each synthetic user has a profile, demographic information, a biography, life events, and preference-related conversations. The original preferences are generated to be diverse, persona-related, unique, value-free, and complete. That “value-free” requirement is important. If every preference is socially obvious—healthy food, clean hotels, responsible spending—the model can pass by guessing what a sensible person might want. Real personalization starts where common sense is no longer enough.

The paper also distinguishes original preferences from generalized preferences. Original preferences are expressed in the dialogue. Generalized preferences are not directly stated but can be rationally inferred from the reasons behind earlier preferences. For example, someone who prefers unconventional food, boutique accommodations, and experimental makeup may also prefer independent cinema over formulaic blockbusters. That is not a memorization test. It is a controlled test of preference reasoning.

A production assistant faces this constantly. A user who dislikes crowded travel spots may also prefer quieter restaurants, off-peak scheduling, and small-group tours. But the assistant must infer this carefully. Too little generalization makes the system forgetful. Too much generalization makes it creepy. The line between “thoughtful” and “why are you watching me?” is thinner than product copy usually admits.

Four preference expressions expose where models stop being attentive

RealPref uses four expression types, moving from easy to difficult:

Preference expression type	What it looks like	Main capability tested
Explicit direct statement	“I prefer X over Y.”	Basic extraction
Explicit contextualized mention	Preference appears naturally inside a broader conversation	Extraction under distraction
Implicit stylistic expression	Preference is implied through contrast, emotion, or rhetorical cues	Pragmatic interpretation
Implicit experience feedback	Preference emerges across multiple sessions and experiences	Temporal integration

This is the mechanism-first heart of the paper.

Direct statements are close to ordinary instruction following. A model sees the preference, stores it temporarily in its attention pattern, and can often use it. Contextualized mention is harder because the preference is surrounded by other material. Stylistic expression is harder again because the model must infer attitude without a clean “I prefer” marker. Experience feedback is the most realistic and the most painful: the preference is distributed across time.

The paper reports a clear decline from explicit to implicit expression types across the open-ended evaluation dimensions. The decline is especially meaningful because open-ended answers require the model to proactively produce a preference-aligned response, not merely pick the least-wrong option.

This finding should worry anyone building AI assistants for advisory work, sales, education, health coaching, travel planning, wealth management, or customer success.

In those domains, users often reveal preferences through stories. “The last consultant gave me a 60-page report and I never used it” is a preference signal. “I tried budgeting apps before, but they made me feel guilty” is a preference signal. “My team hates dashboards that require manual updating” is a preference signal. These are not formal requirements. They are lived constraints.

A model that only responds to explicit preference syntax will look intelligent during demos and clumsy in real relationships. Very modern. Very expensive. Still clumsy.

Multiple choice makes personalization look easier than it is

One of the paper’s best design choices is its comparison across question types: multiple choice, true-or-false, and open-ended generation.

This is not a minor evaluation detail. It changes what the benchmark is actually measuring.

Multiple-choice questions can overestimate personalization because the model may exploit option patterns. If three options are routine gym workouts and one option is spontaneous community dancing, the odd option may stand out even if the model has not properly retrieved the user’s dislike of monotonous exercise. The model appears personalized, but it is partly solving a test-taking problem.

True-or-false questions reduce this shortcut by presenting only one candidate option. The model cannot compare four answers and pick the strange one. It must decide whether the option fits the user.

Open-ended generation is harder still. The model gets the query and must produce a useful response that reflects the user’s preference. It has to retrieve the relevant preference, interpret it, and express it in a helpful answer.

The paper uses three dimensions for open-ended evaluation:

Dimension	What it checks	Why it matters in business use
Preference awareness	Whether the answer recognizes the user preference	Shows whether the system surfaced the right memory
Preference alignment	Whether the recommendation actually fits the preference	Shows whether memory changed the answer
Answer quality	Whether the response remains useful and constructive	Prevents “personalized but useless” output

This is a useful correction to simplistic AI evaluation.

A chatbot can pass a preference test by mentioning the user’s preference while still giving weak advice. Another chatbot can give useful advice that accidentally violates the user’s preference. A real assistant needs both alignment and usefulness. “I remember you hate crowds, so here are three famous overcrowded attractions with crowd-management tips” is not success. It is failure wearing a name tag.

For business teams, this means personalization evaluation should not be a single binary metric. It needs separate checks for memory retrieval, behavioral alignment, and answer usefulness. Otherwise the test will reward systems that sound personal without being personal.

Long context creates dilution, not just capacity pressure

RealPref’s context configurations are especially important because they separate two mechanisms: total context length and distance from the useful preference signal.

The benchmark includes short “simple” contexts around 2K tokens, then longer contexts around 37K, 72K, 142K, and an extreme setting around 247K tokens. Random conversations are inserted either within the history or appended near the end to increase the distance between the preference expression and the final query.

The result is intuitive but still worth spelling out: preference following degrades as context grows. More specifically, the paper reports declines in preference awareness and preference alignment as histories become longer. The extended setup also suggests that distance from the preference signal matters, not only total token count.

This is the difference between storage and salience.

A long context window means the model can technically receive a large amount of text. It does not guarantee that the right part of the text will dominate the answer at the right time. In a long customer relationship, most prior conversation is irrelevant to the current task. The useful signal is often one sentence, one complaint, one preference, or one repeated pattern buried inside months of interaction.

That is not a context-window problem alone. It is an attention-allocation problem.

For product architecture, the implication is straightforward: dumping raw history into the prompt is a weak personalization strategy. It is expensive, brittle, and hard to audit. The model may have access to everything and still behave as if nothing matters.

A better architecture treats memory as a managed asset:

Layer	Function	Practical design question
Raw history	Preserve interaction records	What did the user actually say?
Preference extraction	Convert dialogue into candidate preferences	What stable signals can be inferred?
Preference store	Keep structured, updateable memory	What should persist across sessions?
Retrieval	Surface relevant memories for the current task	Which preferences matter now?
Response policy	Apply preferences without overreaching	How should the answer change?
Audit trail	Explain why a preference was used	Can the user inspect or correct it?

This is less glamorous than saying “we use a million-token model.” It is also closer to how a serious product should work.

Reminders help because models are not always proactive

One of the paper’s more practical findings is that simple reminder prompts can help substantially.

The tested improvement methods include:

Method	Likely purpose in the experiment	What it supports	What it does not prove
Reminder	Test whether explicit instruction to recall preferences improves behavior	Some failures come from lack of proactivity, not total inability	A reminder alone is not a robust memory system
Few-shot chain-of-thought	Show examples of preference-aware answers	Demonstrates that response style and reasoning pattern can be induced	Does not guarantee retrieval from very long histories
Retrieval-augmented generation	Provide the top relevant history turns	Tests whether external retrieval improves preference following	Retrieval quality itself becomes a system dependency

In contexts within the model’s retrieval capability, Reminder, Few-shot CoT, and RAG show similar improvement effects for stronger models. The simplest intervention can compete with more elaborate prompting. That sounds almost too cheap, which is why it is worth interpreting carefully.

A reminder does not create memory. It changes the model’s behavior by making preference recall part of the immediate task. The model may already have enough information in context, but without a reminder it does not actively use it. In other words, the model is not only limited by what it knows. It is limited by what it decides to attend to.

That is a product lesson hiding inside an evaluation result.

Many assistant failures are not caused by missing data. The user preference is somewhere in the transcript. The problem is that the system never converted it into a live constraint for generation. A reminder prompt can partially patch that failure, but it is not enough when histories become too long or preference signals become too buried.

In the extreme context setting, RAG becomes more valuable. That is also predictable: when the full history exceeds the model’s effective retrieval ability, external retrieval can move the relevant memory from the attic to the desk.

The business inference is not “always use RAG.” The better inference is conditional:

Situation	Likely best intervention
Preference is explicit and recent	Reminder may be enough
Preference is explicit but buried in long history	Retrieval is likely needed
Preference is implicit across sessions	Extraction and summarization become critical
Preference must generalize to a new domain	Reasoning policy and guardrails matter
Preference may be sensitive or outdated	User control and verification are required

A mature personalization system should not choose between prompt reminders and retrieval as if they are rival religions. It should use both, plus structured memory, plus update logic. Architecture is usually boring until it saves the product.

Generalization is the difference between memory and understanding

RealPref’s generalized preference tests are important because they move beyond “remember what the user said.”

The model sees original preference expressions in the context. Then it must answer queries involving generalized preferences that were not directly stated but can be inferred from similar reasons. The paper reports that models perform worse on generalized preference queries than on original preference queries in the same zero-shot setting. Reminder improves performance, but generalized preference following remains a harder problem, especially for stronger models where generalized preference scores still trail original preference scores.

This result should shape how businesses define personalization.

A weak assistant can repeat stored preferences.

A better assistant can apply stored preferences.

A genuinely useful assistant can infer adjacent preferences carefully, explain the inference, and let the user correct it.

That last part matters. Generalization is where personalization becomes valuable—and risky. If a system only repeats known facts, it becomes a searchable notebook. If it generalizes too freely, it becomes a stereotype machine with a friendly interface.

For enterprise systems, preference generalization should therefore be treated as a policy-controlled inference layer, not a casual byproduct of model creativity.

A reasonable assistant might say:

“Since you usually prefer quieter, locally run places over crowded tourist spots, I’d start with smaller neighborhood restaurants. I may be overgeneralizing, so tell me if this trip is different.”

That response does three things: it uses memory, applies it to a new case, and keeps the inference corrigible. This is not just good manners. It is risk control.

In regulated or sensitive domains, the stakes rise. A financial assistant should not infer risk tolerance from a few casual remarks and quietly adjust portfolio advice. A health assistant should not infer medical preferences from emotional tone alone. A workplace assistant should not infer employee personality traits and route tasks accordingly. Personalization without boundaries becomes profiling. Profiling with a chatbot smile is still profiling.

What RealPref directly shows, and what Cognaptus infers

It is useful to separate the paper’s direct evidence from the business interpretation.

Claim	Evidence from RealPref	Business meaning	Boundary
MCQ can overstate preference-following ability	MCQ scores are high and less discriminative; true-or-false and open-ended tasks are more revealing	Do not evaluate personalization with easy recognition tasks only	The paper’s exact option artifacts are benchmark-specific
Implicit preferences are harder	Scores decline from explicit statements to implicit stylistic and experience-feedback expressions	Assistants need preference extraction beyond direct user declarations	Synthetic dialogues may not cover all real conversational styles
Long context degrades preference use	Performance drops as context grows and as useful signals are placed farther from the query	Large context windows do not replace memory architecture	Figure results are directional; deployment behavior depends on model and system design
Reminder helps	Reminder prompts improve performance in several settings	Some failures are due to lack of proactivity	Prompting is a patch, not governance
RAG helps in longer contexts	Retrieval improves performance when context exceeds effective retrieval capability	External memory and retrieval are operationally important	Retrieval quality, indexing, privacy, and recency remain unresolved
Generalization remains difficult	Models perform worse on generalized preferences than original ones	Personalization requires cautious inference, not just stored memory	Over-generalization can create user trust and compliance risks

The paper directly shows benchmark behavior under controlled synthetic conditions. Cognaptus’s business inference is that personalization should be engineered as a lifecycle: detect, store, retrieve, apply, generalize, and audit.

That is a heavier product requirement than “add memory.”

It also explains why many AI assistant demos feel better than the deployed product. Demos use short, clean, recent signals. Real users create messy, old, contradictory, evolving signals. A benchmark that stresses long-horizon preference following is therefore closer to the product problem than a benchmark that asks the model to obey one fresh instruction at the bottom of the prompt.

The boundaries: synthetic users, judge models, and changing preferences

RealPref is useful, but it is not the final word on personalization.

First, the dataset is synthetic. Synthetic data enables control, scale, and privacy protection. It also risks smoothing out the messiness of human behavior. Real people contradict themselves, change their minds, express preferences differently across cultures, and sometimes say things they do not actually want the assistant to remember.

Second, open-ended evaluation relies on an LLM judge with predefined rubrics. The rubrics are more granular than a binary score, which is a strength. Still, LLM judging is not identical to human judgment, especially when the question involves taste, social nuance, privacy expectations, or whether a response feels uncomfortably personal.

Third, the benchmark focuses on preferences actively shared through dialogue. That is a sensible boundary. But deployed systems often combine chat history with clicks, purchases, location, device behavior, uploaded files, CRM records, and third-party data. Once those signals enter the system, the personalization problem becomes not only technical but contractual: what did the user consent to, what can they inspect, and what can they delete?

Fourth, preferences in RealPref are not primarily dynamic. In reality, preference memory needs versioning. A user may dislike hostels at age 25 and prefer boutique hotels at 35. A sales lead may initially care about price, then shift toward reliability after a failed implementation. A customer may reject automation until staffing pressure changes the economics. Personalization systems need decay, conflict resolution, and explicit update mechanisms.

A memory that never changes is not personalization. It is fossilization.

The business value is diagnosis, not a magical assistant

The most useful contribution of RealPref is diagnostic.

It gives teams a way to ask: where is our assistant failing?

Is it failing because it cannot identify implicit preferences? Because it loses relevant information in long context? Because it retrieves the wrong memory? Because it remembers but does not apply? Because it applies too literally? Because it cannot generalize? Because it generalizes without permission?

Those are different product failures. They require different fixes.

A CRM copilot that forgets a client prefers concise proposals needs a retrieval and memory layer. A travel assistant that notices “quiet places” but recommends isolated locations unsafe for the user has an application-quality problem. A financial assistant that infers risk appetite from casual language has a governance problem. A tutoring assistant that remembers a student dislikes long explanations but never adapts lesson structure has an alignment-to-output problem.

RealPref’s mechanism-first lesson is therefore simple:

Personalization is not a feature. It is a pipeline.

And every stage of that pipeline can break.

The next generation of AI assistants will not become personal merely because models get larger or context windows get longer. They will become personal when systems learn how to transform user history into governed, relevant, updateable, and inspectable constraints on future responses.

Until then, the assistant may remember your words.

It may still forget what they mean.

Cognaptus: Automate the Present, Incubate the Future.

Qianyun Guo, Yibo Li, Yue Liu, and Bryan Hooi, “Towards Realistic Personalization: Evaluating Long-Horizon Preference Following in Personalized User-LLM Interactions,” arXiv:2603.04191. ↩︎

The easy misconception: put more history in the prompt#

RealPref tests the preference lifecycle, not just recall#

Four preference expressions expose where models stop being attentive#

Multiple choice makes personalization look easier than it is#

Long context creates dilution, not just capacity pressure#

Reminders help because models are not always proactive#

Generalization is the difference between memory and understanding#

What RealPref directly shows, and what Cognaptus infers#

The boundaries: synthetic users, judge models, and changing preferences#

The business value is diagnosis, not a magical assistant#