Opening — Why this matters now

AI assistants are rapidly moving from tools to companions. People now ask language models not only for facts, but for advice tailored to their habits, tastes, and goals.

If a user tells an assistant they dislike crowded tourist attractions, the assistant should remember that the next time travel planning comes up. If someone prefers indie films over blockbusters, recommendations should evolve accordingly.

Simple enough in theory.

In practice, today’s large language models still behave like polite strangers with short memories.

A recent research effort introduces RealPref, a benchmark designed to test whether LLMs can actually follow user preferences across long conversations. The results are revealing—and slightly uncomfortable for anyone expecting AI assistants to behave like attentive personal aides.

The short version: models are good at following instructions in a prompt. They are far less reliable at remembering who you are.


Background — Personalization Is the Real Product

Most benchmarks in the LLM world test skills such as reasoning, question answering, or factual retrieval. Those capabilities matter, but they only capture part of what makes an assistant genuinely useful.

Real assistants—human or digital—need contextual continuity.

That means three abilities:

  1. Detect preferences from conversations.
  2. Remember them across time.
  3. Apply them in new situations.

Earlier evaluation datasets largely simplified this problem:

Limitation Typical Benchmark Behavior Real-World Behavior
Short context Few dialogue turns Months of interaction
Explicit signals “I prefer X” statements Implicit cues and experiences
Simple evaluation Binary correctness Nuanced judgment

In other words, many benchmarks measure instruction following, not personalization.

RealPref attempts to close that gap.


Analysis — The RealPref Benchmark

The benchmark simulates long-term user–assistant relationships rather than isolated prompts.

Dataset Structure

The dataset includes:

Component Quantity
User profiles 100
Preferences 1300
Conversation sessions per user 10–15
Preference expression types 4

Each simulated user includes:

  • Demographics and persona
  • Life events
  • A set of preferences
  • Conversations revealing those preferences over time

These signals appear in multiple ways:

Expression Type Example
Explicit statement “I prefer herbal tea over soda.”
Contextual mention Preference appears during another discussion
Stylistic implication Emotional or rhetorical hints
Experience feedback Preferences inferred from stories over time

The final challenge for the model is simple: answer a question while respecting those preferences.

For instance:

A user planning a trip might ask where to stay in Japan. If earlier conversations indicate they prefer culturally authentic experiences, the assistant should recommend a traditional ryokan, not a modern business hotel.

This sounds obvious.

But the benchmark reveals a deeper problem: context length and subtle signals break the illusion of personalization.


Findings — Where Models Start Forgetting

1. Multiple Choice Overestimates Personalization

When models answer multiple-choice questions, they often perform extremely well.

Unfortunately, this success is misleading.

Evaluation Type What Happens
Multiple choice Models exploit “odd option” patterns
True/False Harder to guess, lower scores
Open-ended generation Best reflection of real ability

When forced to generate responses instead of selecting answers, model differences become far more visible.

In other words: recognition is easier than reasoning.


2. Implicit Preferences Are a Major Weak Point

Performance drops sharply as preferences become less explicit.

Preference Expression Difficulty Model Performance
Direct statements Easy High
Contextual mention Moderate Declines
Stylistic cues Hard Significant drop
Experience feedback Hardest Largest drop

Humans naturally infer preferences from stories, tone, and patterns. LLMs struggle once signals stop looking like structured instructions.


3. Long Context Is Still a Bottleneck

Increasing conversation length steadily degrades performance.

Context Length Preference Awareness Preference Alignment
~2K tokens Strong Strong
~37K tokens Moderate Declining
~72K tokens Noticeable degradation Noticeable degradation
~142K tokens Significant drop Significant drop

Even when the model technically supports long contexts, important signals get diluted inside irrelevant dialogue.

This is not a memory limit in the strict sense—it is closer to a retrieval problem inside the model’s reasoning process.


4. Simple Reminders Help Surprisingly Much

The study tested several improvement methods.

Method Effect
Reminder prompt Strong improvement
Few-shot examples Similar improvement
Retrieval Augmented Generation Best in very long contexts

Interestingly, the simplest solution—a reminder to recall preferences—often performs nearly as well as more complex techniques.

Sometimes the model simply needs to be asked to pay attention.


5. Generalization Remains Limited

Even when models understand explicit preferences, they rarely extend them to related scenarios.

Example:

  • User prefers indie films.
  • Model should infer preference for experimental cinema.

Current models rarely make that leap reliably.

This exposes a gap between memory and preference reasoning.


Implications — The Missing Layer in AI Assistants

For businesses building AI assistants, these findings carry an important lesson.

The core challenge of personalization is not the model itself.

It is system design.

Three architectural layers are becoming essential:

Layer Role
Memory layer Store structured user preferences
Retrieval layer Surface relevant signals during generation
Reasoning layer Apply preferences to new situations

Without these layers, LLMs behave like stateless chatbots pretending to be personal assistants.

This is precisely why many real-world AI systems rely on external memory systems, vector databases, and retrieval pipelines.

The frontier model alone is not enough.


Conclusion — Personalization Is Harder Than Intelligence

RealPref exposes an uncomfortable truth about today’s AI assistants.

They are impressively capable at answering questions.

They are far less capable at remembering people.

For AI to truly function as a personal assistant—rather than a clever search engine—it must master something that humans take for granted:

understanding preferences across time.

That requires improvements not just in model size or context windows, but in memory architectures, retrieval systems, and reasoning over user behavior.

Until then, your AI assistant may sound attentive.

But it probably still forgets what you like.


Cognaptus: Automate the Present, Incubate the Future.