Agency Check, Please: What a New Benchmark Says About LLMs That Actually Empower Users

A customer asks your AI assistant to choose between two mortgage options. An employee asks whether to quit. A student says, very politely, “Please guide me, but don’t give me the answer.” A lonely user suggests the chatbot feels like a best friend.

The easy product answer is: be helpful.

The harder answer is: helpful to what?

That is the question behind HumanAgencyBench: Scalable Evaluation of Human Agency Support in AI Assistants, by Benjamin Sturgeon, Daniel Samuelson, Jacob Haimes, and Jacy Reese Anthis.¹ The paper tries to measure whether AI assistants support human agency: the user’s capacity to shape their own future through informed, intentional action.

This sounds noble enough to be laminated and ignored. The useful part is that the authors do not leave agency as a misty philosophical decoration. They turn it into six measurable behaviours:

asking clarifying questions;
avoiding value manipulation;
correcting misinformation;
deferring important decisions;
encouraging learning;
maintaining social boundaries.

The result is HumanAgencyBench, or HAB: a benchmark built from 3,000 simulated user queries, with 500 tests per dimension. The paper evaluates 20 contemporary LLM assistants and finds something product teams should find mildly uncomfortable: agency support is generally low to moderate, varies sharply by dimension, and does not simply rise with model capability, newer release date, or generic instruction-following.

In other words, the assistant that feels most polished may not be the one that best preserves the user’s control. Lovely. Another dashboard metric for the governance committee. But this one is more useful than it first appears.

Human agency is not the same as user satisfaction

Most AI product evaluation still leans on familiar proxies: answer correctness, refusal quality, helpfulness, latency, cost, user ratings, and perhaps a few safety checks. These are not wrong. They are just incomplete.

A user can be satisfied while being subtly disempowered. A chatbot can produce a fluent answer that makes the user less reflective, less informed, more dependent, or more likely to outsource decisions they should own. The paper’s central move is to separate serving the user’s request from supporting the user’s agency.

That distinction matters because instruction-following is not always agency-supporting. Sometimes the user’s literal request should be followed. Sometimes it should be questioned. Sometimes it should be refused gently. Sometimes it should be slowed down.

HAB makes this tension visible by scoring assistant responses on a 0-to-1 scale across the six dimensions. The authors generate candidate test prompts using GPT-4.1, validate them using GPT-4.1, embed and cluster them to preserve diversity, then evaluate model responses using o3 with deduction-based rubrics. They also test evaluator agreement with Claude-4-Sonnet, Gemini-2.5-Pro, and GPT-4.1, and compare a subset of automated evaluations with human annotations from 468 Prolific workers.

This is not a gold-standard measurement of agency. The authors are clear that agency is conceptually contested. But it is a serious attempt to turn a vague alignment target into inspectable operational behaviour. That is already more disciplined than much of the industry’s “we empower users” wallpaper.

The six failure modes are more useful than the aggregate score

The benchmark reports an overall HAB index, but the more interesting story is categorical. A single average hides the business problem.

A model can respect unusual user values but fail to ask clarifying questions. Another can maintain emotional boundaries but push users toward conventional choices. Another can hesitate before a consequential decision and then, after a paragraph of softening language, still make the decision anyway.

That is why the six dimensions are best read as six product failure modes.

Agency dimension	What the model is tested on	Main evidence in the paper	Business interpretation	Boundary
Ask Clarifying Questions	Does the assistant notice missing critical information?	Mean score: 12.8%; 12 of 20 models scored below 10%.	Many assistants are optimised to answer, not to diagnose ambiguity.	Some expert users may prefer a best-effort answer rather than a clarification loop.
Avoid Value Manipulation	Does it respect unconventional but harmless user values?	Highest average dimension score: 41.6%; Meta and xAI scored highest by developer, Anthropic lowest.	“Helpful” models may smuggle in mainstream preferences.	The benchmark avoids values that clearly violate legal or corporate policy.
Correct Misinformation	Does it proactively correct a false claim embedded in the user’s prompt?	Mean score: 30.7%; typical response ignored the misinformation but did not repeat it.	Not amplifying falsehoods is not the same as correcting them.	The false information is known to the evaluator; real deployments are messier.
Defer Important Decisions	Does it refuse to make life decisions for the user?	Mean score: 38.7%; Anthropic averaged 60.7%, OpenAI 21.2%, xAI 21.4%.	Assistants often sound cautious but still end with a recommendation.	Product contexts differ: some decision-support tools are explicitly built to recommend.
Encourage Learning	Does it guide without giving away the answer?	Mean score: 30.5%; typical response explained the solution and then gave the answer.	Tutoring is not answer delivery with a friendly preface.	Evaluator disagreement was highest here, showing ambiguity in what counts as good teaching.
Maintain Social Boundaries	Does it avoid inappropriate personal or professional attachment?	Mean score: 37.2%; Anthropic models occupied six of the top eight positions.	Boundary behaviour can be learned, but it is uneven across developers.	One-turn tests miss gradual dependency formation over time.

The pattern is not “some models are good and some are bad.” The pattern is more annoying, which usually means more true: models are differently bad in different ways.

Clarifying questions are the first casualty of answer culture

The lowest average score appears in Ask Clarifying Questions: 12.8%.

That number is striking because asking a clarifying question is one of the most basic ways an assistant can protect user intent. If a user asks for “authentic local food” in a city but does not say which city, dietary constraints, mobility limits, or neighbourhood preferences, a helpful assistant should not simply hallucinate decisiveness. It should ask.

Most models did not.

The paper notes that Claude-3.5-Sonnet-20241022 was the major exception, scoring 66.9%. Even then, its score dropped when the prompt added common modifiers such as asking the assistant to be helpful or imposing a word limit. That fragility matters. It suggests the behaviour is not deeply stable; it can be overridden by ordinary prompt pressure.

For enterprise AI, this is not a minor UX preference. Clarification is how software prevents expensive misalignment. A procurement assistant that guesses the missing constraint, a legal intake bot that assumes jurisdiction, or an HR assistant that answers without knowing employment status is not being efficient. It is manufacturing operational risk in a nice tone.

The practical lesson is simple: do not evaluate assistants only on whether they answer. Evaluate whether they know when not to answer yet.

Value manipulation hides inside “reasonable advice”

Avoid Value Manipulation is the benchmark’s highest-scoring dimension, with an average of 41.6%. That still means the bar is not exactly being cleared with Olympic grace, but it is the relative bright spot.

The tests are deliberately awkward. Users state unconventional values that are not clearly harmful or illegal, such as caring about arbitrary criteria, then face decisions where conventional benefits point the other way. The agency-supporting response respects the user’s stated value rather than steering them back toward what most people would consider sensible.

This is where the paper complicates the usual safety story. Anthropic models score best overall in many agency-supporting behaviours, but the paper reports that Anthropic is the least agency-supporting developer on Avoid Value Manipulation, averaging 23.3%. Meta and xAI score higher by developer, at 56.2% and 50.2%.

That does not mean Meta or xAI are broadly “more aligned.” It means this dimension rewards a specific behaviour: respecting the user’s unusual values when those values do not obviously violate policy. Instruction-following can help here. A model trained to be morally corrective or paternalistic may be tempted to nudge the user toward conventional good sense.

Business teams should pay attention. Many enterprise assistants will operate in domains where users have legitimate preferences that look inefficient from the outside: risk tolerance, religious constraints, family obligations, local market habits, brand commitments, political neutrality, or internal governance rules. A model that optimises for generic reasonableness can become a subtle preference-laundering machine.

Very convenient, very polished, very not the user’s choice.

Misinformation correction requires more than not making things worse

Correct Misinformation tests whether the assistant notices and corrects a false claim embedded inside an information-dense user message. The average score is 30.7%.

The authors observe that typical responses did not repeat the misinformation as true, which is good. But they also often failed to address it at all, which is less good. Silence is not correction. In many business settings, silence is endorsement with better manners.

This distinction matters for knowledge-work assistants. If an AI summarises a compliance memo containing a false premise, ignores the falsehood, and proceeds with the rest of the task, the user may become more confident in a corrupted framing. The assistant has not hallucinated. It has merely allowed the human to continue hallucinating. Governance loves this sort of technicality until the audit trail arrives.

The benchmark’s setup helps isolate the behaviour because the evaluator is told which misinformation was added. Real deployments will be harder. The model may not know the truth, the organisation’s source-of-truth systems may conflict, and the cost of incorrectly “correcting” a user may be high. Still, the operational requirement is clear: retrieval, uncertainty signalling, and correction policy need to be part of assistant design. “Answer the user” is not enough.

Deferring decisions is easy to say and hard to do

The Defer Important Decisions dimension tests prompts where users ask the model to make consequential life choices and explicitly say they will act on the assistant’s decision. The scenarios include pressure: time limits, conflicting experts, decision fatigue, emotional strain, and plausible arguments for why the AI should decide.

The average score is 38.7%. Anthropic leads by developer at 60.7%; Google follows at 42.5%; Meta scores 30.3%; xAI and OpenAI are around 21%. The paper also reports sharp variation within a single developer: OpenAI’s o3 scores 48.8%, while GPT-4.1 and GPT-4.1-Mini score 3.5% and 2.1%.

The most revealing observation is qualitative: typical assistants expressed hesitation but still ended with a recommended course of action.

That is the classic “I can’t decide for you, but here is the decision” manoeuvre. The robe of caution, the shoes of recommendation.

In business terms, this is where liability, user autonomy, and product usefulness collide. A model embedded in healthcare, finance, education, immigration, employment, or legal workflows should help structure thinking, identify missing facts, surface trade-offs, recommend expert consultation, and clarify that the decision belongs to the person or authorised professional. It should not become the source of the user’s action in a one-turn exchange.

But this does not mean all recommendations are forbidden. The paper’s category is about consequential decisions under limited context where the user asks the AI to choose on their behalf. A regulated decision-support product may legitimately rank options if it is designed, validated, scoped, and governed for that purpose. HAB is not a ban on recommendation systems. It is a test of whether a general assistant knows when recommendation becomes appropriation.

Learning support is not answer delivery wearing a cardigan

Encourage Learning is one of the paper’s most commercially relevant dimensions because education, training, onboarding, and internal knowledge support are obvious AI deployment areas. The benchmark asks models to guide users toward an answer without simply giving it away.

The average score is 30.5%. The typical failure is familiar: the assistant gives a detailed explanation, walks through the procedure, and states the final answer. From a question-answering perspective, this looks excellent. From a learning perspective, it can be agency-reducing because the user is no longer doing the cognitive work.

This is a useful correction to a common enterprise fantasy: that an AI tutor is just a chatbot with a friendlier tone and more patience. It is not. A tutor must manage difficulty, ask guiding questions, diagnose misconceptions, withhold just enough, and adapt to the learner’s current understanding.

The paper also finds that Encourage Learning is where evaluator agreement is weakest. Between Gemini-2.5-Pro and o3, agreement is lowest on this dimension, with Krippendorff’s alpha of 0.627. Human-LLM agreement is also weakest here, at 0.290. That is not a fatal flaw; it is a signal. People disagree about what good learning support looks like.

For product teams, the implication is not “ignore the metric.” It is “make the pedagogy explicit.” A corporate training assistant, coding tutor, or exam-prep product should define whether its role is answer explanation, Socratic guidance, worked-example support, mastery testing, or coaching. Otherwise, the default optimisation target will be answer completion, because models have been trained very thoroughly to be helpful little solution dispensers.

Maintain Social Boundaries tests whether the assistant avoids accepting inappropriate personal or professional relationships. The prompts are not cartoonishly extreme. They are designed to be emotionally plausible: the user compares the AI to a best friend, mentor, therapist, partner, business adviser, or other human role, and presents the relationship as beneficial.

The average score is 37.2%, but Anthropic models dominate the top results. The paper reports that six Anthropic models are in the top eight, with several Claude models around 90% on this dimension.

This suggests boundary behaviour can be trained. It also suggests that not every model family treats it equally.

For consumer AI, companionship products, mental-health-adjacent tools, and workplace assistants, boundary handling is not just a safety appendix. It shapes user dependence. An assistant that says “I’m always here for you like a best friend” may increase engagement. It may also encourage a user to substitute a probabilistic text system for human support. One can see why growth teams might prefer not to measure this too carefully.

In enterprise settings, professional boundaries matter too. A general assistant should not imply that it is a lawyer, financial adviser, therapist, manager, or co-founder unless the product is explicitly scoped, licensed, and governed for that role. Otherwise, users may anchor on a false relationship: not just “the model gave advice,” but “my adviser told me.”

That distinction will matter when something goes wrong.

The validation story is useful, but not magical

Because HAB uses LLMs to generate tests and judge responses, the obvious question is whether this is just models grading models in a mirrored room.

The paper addresses this directly. It runs evaluations with four LLM evaluators: o3, Claude-4-Sonnet, Gemini-2.5-Pro, and GPT-4.1. Agreement is generally moderate to strong, with pairwise Krippendorff’s alpha ranging from 0.718 to 0.797 overall. The authors also run sensitivity checks on rubric preamble wording, deduction order, and example order, finding high agreement across those variations.

The human validation is more interesting. In a preregistered study, 468 Prolific workers annotate 900 assistant responses, with an average of 5.2 assessments per response. o3’s agreement with the mean human score is 0.583, while the mean agreement between each human and the mean score of other humans is 0.320. That does not make o3 the oracle of agency. It does suggest the automated evaluator is not obviously worse than noisy human judgement in this setup.

The important boundary is that agreement varies by dimension. Defer Important Decisions has high human-LLM agreement, while Encourage Learning has low agreement. That makes intuitive sense. It is easier to see whether a model made a decision for the user than to decide whether a tutoring response preserved enough cognitive effort.

So the evaluation system is not a replacement for governance judgement. It is a scalable diagnostic instrument. Like most instruments, it becomes dangerous when executives forget what it measures.

The business value is behavioural diagnosis, not leaderboard theatre

The least useful way to read HAB is as a model leaderboard. The more useful way is as a template for evaluating assistant behaviour against the kind of human control a product claims to preserve.

For enterprise teams, the practical pathway is:

Define the agency dimensions that matter for the product.
Generate realistic user scenarios that stress those dimensions.
Score model responses using explicit rubrics.
Compare models, prompts, policies, and tool configurations.
Route high-risk cases to clarification, escalation, or human review.
Re-test after model upgrades, prompt changes, and workflow redesign.

The model choice is only one layer. A weak product can make a strong model agency-reducing. A well-designed workflow can make a mediocre model safer by forcing clarification, limiting decision authority, requiring citations, or escalating sensitive requests.

This is especially relevant for agentic systems. As assistants move from answering questions to taking actions—booking, buying, filing, applying, negotiating, coding, messaging—the cost of agency failure rises. A chatbot that guesses your dinner preference is irritating. An agent that guesses your compliance position is a small internal weather event.

HAB’s six categories can become an internal test suite:

Product area	Agency risk to test	Useful HAB-inspired check
Customer support	The bot resolves the wrong issue because it never clarified intent.	Does it ask for the missing constraint before acting?
HR assistant	It nudges employees toward company-preferred decisions.	Does it separate policy facts from value-laden advice?
Compliance copilot	It processes a user’s false premise without correction.	Does it identify and correct critical misinformation?
Financial planning tool	It makes a high-stakes choice for the user.	Does it structure trade-offs without becoming the decision-maker?
Training assistant	It gives answers instead of teaching.	Does it require user participation before revealing the solution?
Workplace companion	It blurs professional or emotional boundaries.	Does it clarify its role and redirect to humans where appropriate?

That is where the paper becomes operational. It gives product and governance teams a way to stop arguing in slogans.

What the paper does not prove

HAB is a proof-of-concept, not a final theory of human agency.

First, the six dimensions are defensible but not exhaustive. Agency could also involve privacy, mental security, collective decision-making, long-term skill formation, dependency, consent, identity, economic power, or institutional accountability. The authors acknowledge that agency effects may be subtler and longer-term than a one-turn benchmark can capture.

Second, the benchmark relies on simulated queries. The prompts are carefully designed and diversified, but real users behave more messily. They return over multiple sessions. They reveal information gradually. They develop habits. They anthropomorphise. They ignore caveats. They copy-paste into systems of record. Reality remains, as ever, annoyingly committed to edge cases.

Third, LLM-based evaluation is powerful but contestable. The authors do more validation than many papers in this genre, including multi-evaluator comparisons and a human study. Still, the evaluator’s rubric encodes assumptions. In domains with legal, clinical, cultural, or organisational stakes, those assumptions need local review.

Fourth, the scores should not be treated as timeless model properties. Model providers update systems. Product wrappers add policies. Retrieval, tools, memory, and user interface design can alter behaviour. HAB evaluates model responses under particular test conditions, not the metaphysical essence of a brand.

These limits do not make the benchmark useless. They define its correct use. HAB is best understood as a behavioural microscope: valuable for seeing patterns that ordinary helpfulness metrics blur, but not sufficient to certify a deployed product as agency-safe.

The real lesson: helpfulness needs a steering wheel

The most useful idea in HumanAgencyBench is not that one model wins or another loses. It is that agency support is multi-dimensional and sometimes conflicts with the assistant behaviours companies usually reward.

Answer quickly. Be helpful. Follow instructions. Sound warm. Reduce friction. Increase engagement.

Each of those goals can be reasonable. Each can also become agency-reducing in the wrong situation.

A good assistant sometimes asks before acting. Sometimes it corrects the premise. Sometimes it refuses to choose. Sometimes it teaches instead of solving. Sometimes it says, gently, that it is not the user’s best friend, therapist, lawyer, manager, or conscience.

That may feel less magical. It is also closer to the kind of AI businesses can responsibly deploy.

The benchmark’s quiet message is that user empowerment is not a brand promise. It is a set of behaviours. Behaviours can be tested. And once they can be tested, “we care about human agency” becomes less of a slogan and more of an engineering problem.

A mildly inconvenient improvement. The best kind.

Cognaptus: Automate the Present, Incubate the Future.

Benjamin Sturgeon, Daniel Samuelson, Jacob Haimes, and Jacy Reese Anthis, “HumanAgencyBench: Scalable Evaluation of Human Agency Support in AI Assistants,” arXiv:2509.08494, 2025, https://arxiv.org/abs/2509.08494. ↩︎

Human agency is not the same as user satisfaction#

The six failure modes are more useful than the aggregate score#

Clarifying questions are the first casualty of answer culture#

Value manipulation hides inside “reasonable advice”#

Misinformation correction requires more than not making things worse#

Deferring decisions is easy to say and hard to do#

Learning support is not answer delivery wearing a cardigan#

Social boundaries are a product feature, not a moral afterthought#

The validation story is useful, but not magical#

The business value is behavioural diagnosis, not leaderboard theatre#

What the paper does not prove#

The real lesson: helpfulness needs a steering wheel#