Aligned or Just Agreeable? Why Accuracy Is a Terrible Proxy for AI–Human Alignment

Opening — Why this matters now

As large language models quietly migrate from text generators to decision makers, the industry has developed an unhealthy obsession with the wrong question: Did the model choose the same option as a human? Accuracy, F1, and distributional overlap have become the default proxies for alignment.

They are also deeply misleading.

When decisions are constrained — by time, money, or capacity — matching outcomes tells us very little about how a choice was made. Two agents can land on the same allocation for entirely different reasons, and those reasons are precisely what determine whether the model will generalize, drift, or fail under new conditions.

The paper behind XCHOICE makes an uncomfortable but necessary claim: alignment should be evaluated at the level of decision mechanisms, not surface agreement.

Background — Context and prior art

In economics, this debate is decades old. Reduced-form correlations collapse under policy changes — a problem formalized in the Lucas critique. Structural models, by contrast, aim to recover stable preference parameters and trade-offs that survive environmental shifts.

AI evaluation, however, largely ignored this lesson. Most alignment benchmarks ask whether LLMs replicate observed human choices, not whether they encode similar priorities, constraints, or substitution patterns. In constrained settings — scheduling, budgeting, time allocation — this omission is fatal.

An LLM can respect a budget while misinterpreting why the budget matters. That difference rarely shows up in accuracy metrics — until deployment.

Analysis — What XCHOICE actually does

XCHOICE reframes alignment as a mechanism-comparison problem.

Instead of comparing outcomes, it fits the same constrained optimization model to both human decisions and LLM-generated decisions. Using inverse optimization, it recovers interpretable parameters that represent:

Relative importance of decision attributes
Sensitivity to constraints
Implied trade-offs across options

Alignment is then evaluated by comparing parameter vectors — not decisions — across models, activities, and demographic subgroups.

Conceptually, the framework follows three steps:

Model decisions structurally as constrained optimization problems
Estimate latent preference weights from observed (human) and generated (LLM) choices
Diagnose misalignment via parameter divergence, invariance tests, and subgroup analysis

The result is not just a score, but a diagnostic instrument.

Findings — Results that accuracy metrics hide

Case study: Daily time allocation

Using the American Time Use Survey as human ground truth, the authors model how individuals allocate 1,440 minutes across work, leisure, sleep/personal care, and other activities.

When LLMs are prompted with realistic demographic profiles, they produce plausible schedules. Outcome-wise, nothing looks obviously broken.

Mechanism-wise, the cracks are immediate.

Model-level alignment (simplified)

Model	Overall Mechanism Alignment
Claude-3.7	High, stable across tasks
GPT-4o	Moderate, activity-dependent
DeepSeek-V3	Mixed, sign reversals
Llama-3.3	Low
Qwen-2.5	Lowest

Some models match humans on what they choose while fundamentally disagreeing on why.

Attribute-level misalignment

Two attributes dominate divergence:

Race: Black
Spouse present

In the human data, both attributes significantly reshape time-allocation trade-offs. Most LLMs either attenuate or outright reverse these effects.

This is not noise. It is systematic reweighting.

Accuracy metrics never flag this — because the model can still land on reasonable-looking averages while erasing subgroup-specific structure.

Robustness under distribution shifts

XCHOICE parameters remain stable under mild shifts in income, age composition, race mix, and marital status. Reduced-form regressions do not.

This matters operationally: mechanism-level alignment is what determines whether a system behaves sensibly when context changes.

Implications — What this means for AI systems

Three uncomfortable conclusions emerge:

Outcome agreement is insufficient
- You can pass benchmarks while encoding the wrong priorities.
Misalignment is heterogeneous
- It concentrates in specific attributes and subgroups, not uniformly across the population.
Interventions must be targeted
- Blind fine-tuning is inefficient. XCHOICE shows where to intervene.

The paper demonstrates this by applying a retrieval-augmented generation (RAG) intervention. When grounded with domain-specific evidence, some models shift their inferred trade-offs closer to human baselines — but only where genuine gaps exist. Where alignment was already strong, RAG sometimes worsened it.

Alignment, it turns out, is not monotonic.

Conclusion — Alignment is a structure problem

XCHOICE does not offer a new benchmark. It offers something more useful: a way to see what alignment actually consists of.

If LLMs are going to act as planners, advisors, and simulators of human behavior, we need to stop asking whether they agree with us — and start asking whether they reason like us under constraints.

Surface agreement is cheap. Mechanism alignment is not.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — Context and prior art#

Analysis — What XCHOICE actually does#

Findings — Results that accuracy metrics hide#

Case study: Daily time allocation#

Model-level alignment (simplified)#

Attribute-level misalignment#

Robustness under distribution shifts#

Implications — What this means for AI systems#

Conclusion — Alignment is a structure problem#