If today’s AI agents are so good with tools, why are they still so bad with people?
That’s the uncomfortable question posed by UserBench, a new gym-style benchmark from Salesforce AI Research that evaluates LLM-based agents not just on what they do, but how well they collaborate with a user who doesn’t say exactly what they want.
At first glance, UserBench looks like yet another travel planning simulator. But dig deeper, and you’ll see it flips the standard script of agent evaluation. Instead of testing models on fully specified tasks, it mimics real conversations: the user’s goals are vague, revealed incrementally, and often expressed indirectly. Think “I’m traveling for business, so I hope to have enough time to prepare” instead of “I want a direct flight.” The agent’s job is to ask, interpret, and decide—with no hand-holding.
Tool use is solved. People aren’t.
Across both open and closed models (GPT-4o, Claude 4, Gemini, Qwen, DeepSeek, etc.), the findings are stark:
Metric | GPT-4o | Claude 4 | Gemini Pro | DeepSeek | Qwen 32B |
---|---|---|---|---|---|
Best Option Chosen (Single-choice) | 20.4% | 26.0% | 24.5% | 14.8% | 15.4% |
Correct Option Exists | 36.1% | 31.8% | 32.8% | 21.8% | 21.1% |
Active Preferences Elicited | 24.1% | 26.3% | 23.9% | 18.9% | 13.0% |
Agents consistently under-ask. Even the strongest models extract less than 30% of user preferences proactively. They often guess, and when allowed to submit multiple options, brute-force their way to passable results. But when forced to choose just one? Performance collapses.
Why? Because users are messy.
UserBench encodes three core traits of real human interaction:
- Underspecification: People don’t always know what they want at the start.
- Incrementality: Preferences evolve during the interaction.
- Indirectness: Goals are softened, implied, or context-dependent.
These aren’t bugs in user communication—they’re features of how humans work. Yet most agents behave like interns with checklists, not collaborators with empathy.
What makes UserBench hard (and important)
It turns out that the hardest scenarios aren’t the ones with more travel aspects (flight + hotel + restaurant), but those with more layered preferences per aspect. For instance:
- Choosing a hotel with parking and a high review score and a king bed
- Understanding that “I hate waiting at airports” implies “no layovers”
Agents break down when they have to reason over compound, implicit constraints. Notably, reducing the number of distractor options (wrong or noisy answers) helps a bit, but doesn’t change the core difficulty: interpreting vague humans.
Design for alignment, not just accuracy
UserBench isn’t just a benchmark; it’s a design philosophy. It gives us a framework to:
- Measure user alignment explicitly: via preference elicitation rates and timing-weighted scores
- Balance speed vs understanding: models that answer too early miss context; those that ask endlessly waste time
- Incentivize adaptive interaction: reward designs that penalize early guessing and promote meaningful clarification
This is where reinforcement learning with user-in-the-loop feedback becomes critical. Supervised fine-tuning won’t teach models how to say “Could you clarify your preferred neighborhood?” at the right time. But RL with delayed, turn-aware rewards just might.
Toward agents that actually get you
The UserBench paper ends with an elegant observation: today’s models are efficient executors, but not collaborative partners. If we want agents that truly help, they can’t just be smart. They have to listen, ask, and adapt.
UserBench doesn’t measure how fast your model can query a travel API. It measures something deeper: how well your model can deal with you, in all your ambiguity and changeability.
That’s not a technical problem. That’s a human one.
Cognaptus: Automate the Present, Incubate the Future