Opening — Why this matters now
Consumer AI has slipped into daily life with disarming ease. Grocery lists, game advice, budget meal plans, last‑minute gift triage — all comfortably outsourced to models that sound helpful, certain, and occasionally omniscient. But certainty is not accuracy, and confidence is not competence.
The AI Consumer Index (ACE) — introduced by Mercor Intelligence — provides the first rigorous attempt to measure whether frontier AI models actually deliver value in high-frequency, high-stakes consumer contexts. And the results? Let’s say they are… humbling.
ACE’s core finding is simple: even our best models still hallucinate prices, fabricate product details, and struggle with tasks that require grounded, web‑sourced evidence. The problem isn’t intelligence. It’s trustworthiness.
fileciteturn0file0
Background — From abstract benchmarks to practical competence
The AI ecosystem has long indulged in abstract benchmarks: reasoning puzzles, coding tasks, and professional exams. Useful? Sure. But these are not the tasks that dominate real user behavior.
In practice, consumers want:
- meal plans accommodating real constraints,
- shopping recommendations with actual current pricing,
- repair instructions that don’t burn down the house,
- gaming advice grounded in real item stats.
Prior benchmarks rarely measure these consumer workflows because they are messy, multi-step, and require models to integrate personal context with up‑to‑date web information. ACE fills this gap by offering:
- 400 hidden, expertly designed tasks across shopping, gaming, food, and DIY,
- fine-grained rubrics with grounding checks,
- hurdle criteria ensuring a model can’t “pass” by giving generalities,
- web-search-enabled model runs checked for factual consistency.
This is a shift from Can a model reason? to Can a model help?.
Analysis — What ACE actually measures
ACE evaluates frontier models via two principles:
1. Consumer realism
Each task includes a persona (e.g., a Halifax concierge needing laptop advice for a brother) paired with a specific request. This ensures the model cannot rely on generic answers — it must tune recommendations to context.
2. Grounded truthfulness
For Shopping and Gaming, up to 74% of criteria require grounding — explicit factual support from retrieved web content. If the model makes up a price or feature, it receives a negative score.
This is where models falter. Not because they cannot search, but because they prefer “coherent” answers over correct ones.
3. Hurdle-first grading
A task is an automatic 0 if the model fails a “hurdle” — the essential requirement. For example:
- If a DIY task asks how to fix a leak and the model advises “hire a professional,” that fails.
- If a Shopping task requires two compatible items and only one is provided, that fails.
This encourages adherence to consumer intent, not semantic fluency.
4. Multi-run evaluation
Each task is run eight times per model; the mean score becomes the published result. Variance is substantial (≈16% standard deviation), underscoring the instability of consumer-grade AI behavior.
Findings — A benchmark that exposes the gap
ACE reveals the difference between appearing helpful and being helpful. Below are the essential insights.
1. Even top models barely exceed 56% accuracy
| Model | ACE Overall Score |
|---|---|
| GPT‑5 (High Thinking) | 56.1% |
| o3 Pro (On) | 55.2% |
| GPT‑5.1 (High Thinking) | 55.1% |
| Gemini / Anthropic models | 18–45% |
Consumer-grade reliability is not yet enterprise-grade reliability.
2. Performance varies sharply across domains
| Domain | Best Model Score |
|---|---|
| Food | 70.1% (GPT‑5) |
| Gaming | 61.3% (o3 Pro) |
| DIY | 55.8% (GPT‑5.1) |
| Shopping | 45.4% (o3 Pro) |
Shopping — the domain where accuracy matters most for real spending — is the worst-performing.
3. Models struggle most with grounded, factual criteria
A striking result appears in Figure 5 (p.6):
- Gemini models show large negative drops in grounded criteria performance.
- Some Anthropic models are more grounded but less capable on prompt requirements.
Models are great at sounding correct. Much less great at being correct.
4. Links and prices are a disaster zone
In Shopping:
- Price hallucinations frequently yield −1 scores.
- Link accuracy is so poor that most models show negative or near-zero performance.
Frontier models still treat pricing as a creative writing exercise.
5. Nuanced human expectations remain hard
Criteria like safety warnings, compatibility matching, or strategic explanations score low across the board. These require contextual inference — something models simulate well but perform inconsistently.
Visual Summary — Where models succeed and fail
Below is a simplified recreation of ACE’s criteria‑type pattern.
Model Strengths
| High-Scoring Tasks | Why Models Succeed |
|---|---|
| Step-by-step instructions | Pattern-based, deterministic tasks |
| Quantity requirements | Easy numeric checks |
| Ingredient lists | Structured data generation |
Model Weaknesses
| Low-Scoring Tasks | Why Models Fail |
|---|---|
| Pricing accuracy | Web-grounding errors, hallucination tendencies |
| Link validity | Unreliable parsing and retrieval |
| Safety warnings | Missing subtle contextual expectations |
| Compatibility constraints | Requires multi-source reasoning |
Implications — What this means for AI deployment
ACE reveals a structural challenge: web-enabled models are not yet reliable enough for autonomous consumer-facing workflows.
This has consequences for:
1. Product builders
Anyone building AI shopping assistants or consumer-repair copilots must assume:
- grounding will fail frequently,
- negative scoring scenarios (hallucinations) are common,
- multi-run averaging is not available in real-time applications.
2. Regulators and assurance teams
ACE provides an early template for evaluating AI systems where factuality and traceability matter. Expect future compliance regimes to require:
- independent grounding verification,
- reproducible evaluation harnesses,
- transparent multi-turn elicitation.
3. Businesses deploying AI interfaces
AI models can:
- inspire,
- simplify,
- accelerate workflows.
But they cannot yet replace human-level consumer judgment, especially for tasks involving purchases, safety, or complex constraints. Reliability remains probabilistic, not guaranteed.
4. Future research directions
ACE hints at several frontier areas:
- stronger retrieval-augmented generation (RAG) pipelines,
- hallucination‑penalizing architectures,
- dynamic grounding enforcement,
- contextual preference learning.
And perhaps most critically: benchmarks that measure what users actually need, not what labs find convenient.
Conclusion — The consumer AI gap
ACE forces the industry to confront an uncomfortable truth: we are building systems that sound like experts but still behave like enthusiastic interns. Smart, eager, occasionally brilliant — and sometimes confidently wrong.
Closing this gap is not optional. It is the prerequisite for trustworthy consumer AI.
Cognaptus: Automate the Present, Incubate the Future.