Opening — Why this matters now

Consumer AI has slipped into daily life with disarming ease. Grocery lists, game advice, budget meal plans, last‑minute gift triage — all comfortably outsourced to models that sound helpful, certain, and occasionally omniscient. But certainty is not accuracy, and confidence is not competence.

The AI Consumer Index (ACE) — introduced by Mercor Intelligence — provides the first rigorous attempt to measure whether frontier AI models actually deliver value in high-frequency, high-stakes consumer contexts. And the results? Let’s say they are… humbling.

ACE’s core finding is simple: even our best models still hallucinate prices, fabricate product details, and struggle with tasks that require grounded, web‑sourced evidence. The problem isn’t intelligence. It’s trustworthiness.

fileciteturn0file0


Background — From abstract benchmarks to practical competence

The AI ecosystem has long indulged in abstract benchmarks: reasoning puzzles, coding tasks, and professional exams. Useful? Sure. But these are not the tasks that dominate real user behavior.

In practice, consumers want:

  • meal plans accommodating real constraints,
  • shopping recommendations with actual current pricing,
  • repair instructions that don’t burn down the house,
  • gaming advice grounded in real item stats.

Prior benchmarks rarely measure these consumer workflows because they are messy, multi-step, and require models to integrate personal context with up‑to‑date web information. ACE fills this gap by offering:

  • 400 hidden, expertly designed tasks across shopping, gaming, food, and DIY,
  • fine-grained rubrics with grounding checks,
  • hurdle criteria ensuring a model can’t “pass” by giving generalities,
  • web-search-enabled model runs checked for factual consistency.

This is a shift from Can a model reason? to Can a model help?.


Analysis — What ACE actually measures

ACE evaluates frontier models via two principles:

1. Consumer realism

Each task includes a persona (e.g., a Halifax concierge needing laptop advice for a brother) paired with a specific request. This ensures the model cannot rely on generic answers — it must tune recommendations to context.

2. Grounded truthfulness

For Shopping and Gaming, up to 74% of criteria require grounding — explicit factual support from retrieved web content. If the model makes up a price or feature, it receives a negative score.

This is where models falter. Not because they cannot search, but because they prefer “coherent” answers over correct ones.

3. Hurdle-first grading

A task is an automatic 0 if the model fails a “hurdle” — the essential requirement. For example:

  • If a DIY task asks how to fix a leak and the model advises “hire a professional,” that fails.
  • If a Shopping task requires two compatible items and only one is provided, that fails.

This encourages adherence to consumer intent, not semantic fluency.

4. Multi-run evaluation

Each task is run eight times per model; the mean score becomes the published result. Variance is substantial (≈16% standard deviation), underscoring the instability of consumer-grade AI behavior.


Findings — A benchmark that exposes the gap

ACE reveals the difference between appearing helpful and being helpful. Below are the essential insights.

1. Even top models barely exceed 56% accuracy

Model ACE Overall Score
GPT‑5 (High Thinking) 56.1%
o3 Pro (On) 55.2%
GPT‑5.1 (High Thinking) 55.1%
Gemini / Anthropic models 18–45%

Consumer-grade reliability is not yet enterprise-grade reliability.

2. Performance varies sharply across domains

Domain Best Model Score
Food 70.1% (GPT‑5)
Gaming 61.3% (o3 Pro)
DIY 55.8% (GPT‑5.1)
Shopping 45.4% (o3 Pro)

Shopping — the domain where accuracy matters most for real spending — is the worst-performing.

3. Models struggle most with grounded, factual criteria

A striking result appears in Figure 5 (p.6):

  • Gemini models show large negative drops in grounded criteria performance.
  • Some Anthropic models are more grounded but less capable on prompt requirements.

Models are great at sounding correct. Much less great at being correct.

In Shopping:

  • Price hallucinations frequently yield −1 scores.
  • Link accuracy is so poor that most models show negative or near-zero performance.

Frontier models still treat pricing as a creative writing exercise.

5. Nuanced human expectations remain hard

Criteria like safety warnings, compatibility matching, or strategic explanations score low across the board. These require contextual inference — something models simulate well but perform inconsistently.


Visual Summary — Where models succeed and fail

Below is a simplified recreation of ACE’s criteria‑type pattern.

Model Strengths

High-Scoring Tasks Why Models Succeed
Step-by-step instructions Pattern-based, deterministic tasks
Quantity requirements Easy numeric checks
Ingredient lists Structured data generation

Model Weaknesses

Low-Scoring Tasks Why Models Fail
Pricing accuracy Web-grounding errors, hallucination tendencies
Link validity Unreliable parsing and retrieval
Safety warnings Missing subtle contextual expectations
Compatibility constraints Requires multi-source reasoning

Implications — What this means for AI deployment

ACE reveals a structural challenge: web-enabled models are not yet reliable enough for autonomous consumer-facing workflows.

This has consequences for:

1. Product builders

Anyone building AI shopping assistants or consumer-repair copilots must assume:

  • grounding will fail frequently,
  • negative scoring scenarios (hallucinations) are common,
  • multi-run averaging is not available in real-time applications.

2. Regulators and assurance teams

ACE provides an early template for evaluating AI systems where factuality and traceability matter. Expect future compliance regimes to require:

  • independent grounding verification,
  • reproducible evaluation harnesses,
  • transparent multi-turn elicitation.

3. Businesses deploying AI interfaces

AI models can:

  • inspire,
  • simplify,
  • accelerate workflows.

But they cannot yet replace human-level consumer judgment, especially for tasks involving purchases, safety, or complex constraints. Reliability remains probabilistic, not guaranteed.

4. Future research directions

ACE hints at several frontier areas:

  • stronger retrieval-augmented generation (RAG) pipelines,
  • hallucination‑penalizing architectures,
  • dynamic grounding enforcement,
  • contextual preference learning.

And perhaps most critically: benchmarks that measure what users actually need, not what labs find convenient.


Conclusion — The consumer AI gap

ACE forces the industry to confront an uncomfortable truth: we are building systems that sound like experts but still behave like enthusiastic interns. Smart, eager, occasionally brilliant — and sometimes confidently wrong.

Closing this gap is not optional. It is the prerequisite for trustworthy consumer AI.

Cognaptus: Automate the Present, Incubate the Future.