Love in the Time of Context: Why LLMs Still Don't Get You
TL;DR for operators Personalization does not fail because the model forgot your birthday. That would be almost charming. It fails because the system remembers too much in the wrong shape. The Cupid benchmark tests whether LLMs can infer a user’s context-dependent preference from prior multi-turn interactions and apply it to a new request.1 The setup is deliberately business-relevant: users do not announce a clean preference profile; they reveal expectations through feedback, correction, and mild conversational friction. Very realistic. Nobody fills out a YAML file called my_deeply_contextual_preferences.yml, at least not outside certain Slack channels. ...