Shopping is where AI confidence goes to embarrass itself.

Ask a frontier model for a gift, a replacement part, a budget-friendly product, or a game recommendation, and the answer often looks excellent. It is neatly formatted. It gives reasons. It may even include links and prices, because apparently nothing says “trust me” like a fabricated discount on a product page that no longer exists.

That is the uncomfortable gap measured by the AI Consumer Index, or ACE, introduced by Mercor Intelligence as a benchmark for everyday consumer AI tasks.1 ACE is not asking whether a model can solve a graduate exam, write Python, or perform theatrical reasoning in a sandbox. It asks whether web-enabled frontier models can satisfy ordinary consumer needs across Shopping, Food, Gaming, and DIY.

The result is not a catastrophic failure story. That would be too easy, and frankly too melodramatic. The more useful story is sharper: the best model reaches 56.1% overall on ACE-v1-heldout, while the strongest Shopping score is still only 45.4%. The gap is not mainly about eloquence. It is about whether the answer meets the user’s core goal, whether the details are grounded, and whether the model is punished for inventing facts while trying to be helpful.

ACE matters because it changes the unit of evaluation. It is not asking, “Did the model produce a plausible response?” It is asking, “Would a consumer be safe, satisfied, and correctly informed if they acted on this?”

That is a much nastier question. Conveniently, it is also the right one.

ACE is really a test of consumer usefulness, not model charm

Most AI benchmarks reward performance on tasks that are easier to freeze: math, coding, exams, academic reasoning, or professional workflows. Those are valuable, but they do not fully describe how many people actually use consumer AI.

Consumer tasks are messy in a different way. A meal plan has dietary constraints, portion expectations, preparation limits, and personal preferences. A DIY answer may need both step-by-step instructions and a warning that the user should not improvise with electricity, mold, gas, load-bearing walls, or whatever other small household adventure is trying to become a lawsuit. A gaming recommendation may need compatibility with a platform, a genre preference, a group size, and a price. Shopping requires the model to interact with the live web, where prices move, links rot, vendors vary, and products vanish.

ACE builds around that mess rather than pretending it does not exist.

The benchmark includes a hidden heldout set of 400 tasks, evenly split across four domains: 100 each for DIY, Food, Gaming, and Shopping. It also releases an open development set of 80 cases, with 20 per domain, under a CC-BY license. The hidden set is used for leaderboard evaluation; the dev set supports transparency and reproducibility without handing the full test to model builders on a silver platter.

The cases were created by 47 subject matter experts. That detail matters. ACE is not simply a pile of synthetic prompts dressed in consumer clothing. The paper says experts included personal shoppers, stylists, shopping editors, game developers, professional gamers, chefs, food editors, nutritionists, tradespeople, construction workers, and mechanical engineers. The point is not that expert judgment is perfect. The point is that consumer usefulness often depends on domain-specific expectations that a generic prompt writer may not know to include.

The benchmark also uses workflow taxonomies. Shopping includes bargain hunting, compatibility, gifting, profile-based recommendations, and vendor recommendations. Food includes meal planning, potluck recommendations, and limited-resource recipe tasks. Gaming includes game design, inspiration, selection, and tactics. DIY includes repairs and crafts.

That taxonomy makes ACE more than a score sheet. It is a map of where consumer AI breaks under different types of friction.

The first mechanism: hurdle criteria stop models from winning by being adjacent

The first clever design choice in ACE is the hurdle.

Each task has criteria, but not all criteria are equally important. A response should not receive meaningful credit for satisfying minor formatting or detail requirements if it fails the user’s central request. ACE therefore marks some criteria as “hurdles.” These capture the core goal of the prompt.

This is a small design decision with large consequences.

Without hurdles, a model can game consumer tasks by being broadly relevant. It might fail to recommend the requested compatible product but still mention a price, include a link, provide a friendly explanation, and appear useful enough to collect partial credit. That is fine if the benchmark measures answer-shaped text. It is not fine if the benchmark measures whether the consumer’s problem was solved.

ACE uses hurdle criteria to gate further reward. If the response fails the core objective, the task score collapses rather than allowing peripheral niceness to compensate.

The paper reports that ACE-v1-heldout averages 1.32 hurdles per case. Food has the highest average, at 1.67 hurdles per case, while DIY has 1.01. That distribution makes sense. Food tasks often combine constraints — dietary, quantity, dish features, preparation limits — where missing one essential condition can make the answer unusable. DIY prompts, by contrast, often revolve around a more singular procedural objective.

The business interpretation is direct: consumer AI evaluation needs acceptance tests before quality scoring.

A chatbot that gives a beautiful answer to the wrong shopping problem is not “partially successful.” It is a polished failure. Very elegant. Still wrong.

The second mechanism: grounding turns hallucination into a penalty, not a neutral miss

The second design choice is more important: ACE treats unsupported factual claims as worse than silence.

For DIY and Food, once a hurdle is passed, the model is scored against rubric criteria with ordinary pass/fail grading. For Gaming and Shopping, ACE adds grounding checks for criteria involving empirical web claims. In ACE-v1-heldout, 42% of Gaming criteria and 74% of Shopping criteria require grounding. Food and DIY do not use grounding checks in this first version.

The grading logic is hierarchical:

Step What ACE checks Why it matters
1 Does the response meet the criterion? Prevents credit for irrelevant or incomplete answers.
2 Does the criterion require grounding? Separates ordinary content quality from factual web claims.
3 Is the claim supported by retrieved sources? Penalizes fabricated links, prices, product facts, and other unsupported details.

The key move is the score assigned to ungrounded claims. If a response does not meet a criterion, it receives zero for that criterion. But if it appears to meet a grounding-required criterion while the claim is unsupported, it can receive a negative score.

That is the mechanism that makes ACE different from many softer evaluations. The benchmark is not merely asking whether the model included a price. It asks whether the price is supported by the retrieved source. If not, the model is not “almost right.” It is actively misleading.

This is especially important for consumer AI because models often fail in a commercially dangerous way. They do not always refuse or admit uncertainty. They produce an answer that looks actionable. The model may recommend a product, provide a link, cite a price, and explain why it fits the user’s needs. If those details are fabricated or stale, the user experiences the failure only after clicking, buying, assembling, cooking, or troubleshooting.

That is why ACE’s negative scoring is not a technical curiosity. It is an operational philosophy: hallucinated helpfulness should cost more than non-helpfulness.

The third mechanism: ACE prevents recommendation spam

A weaker consumer benchmark can be exploited by quantity. If the user asks for a product recommendation and the model returns ten items, perhaps one of them happens to satisfy the criterion. The response then looks successful even though most of the list is noise.

ACE tries to avoid this.

In its grounding implementation, if multiple products are returned, all of the products must meet the relevant requirements for a criterion. If one product’s pricing claim is ungrounded, the response fails that grounding check for that criterion. This prevents models from spraying recommendations and hoping that something sticks.

For product builders, this is one of the most transferable ideas in the paper.

A real shopping assistant should not optimize for “some acceptable answer exists somewhere in this long list.” It should optimize for every recommended item being defensible. In many consumer workflows, the cost of one bad recommendation is not averaged away by nine decent ones. One incompatible replacement part ruins the task. One unsafe DIY omission can be worse than no answer. One fake link destroys trust faster than three correct paragraphs restore it.

The model’s surface area matters. More claims mean more opportunities to be wrong.

The leaderboard is evidence, but not the main lesson

ACE evaluates 10 frontier models from OpenAI, Google DeepMind, and Anthropic, all with web search enabled. Responses were collected at the end of November 2025. The setup uses eight runs per model-task pair, yielding 32,000 model responses across 400 heldout cases, 10 models, and eight repetitions. Because each task has multiple criteria, the paper reports more than 220,000 individual gradings.

The main leaderboard result is straightforward:

Model Provider Overall ACE-v1-heldout score
GPT 5 (High) OpenAI 56.1%
o3 Pro (On) OpenAI 55.2%
GPT 5.1 (High) OpenAI 55.1%
o3 (On) OpenAI 52.9%
Gemini 3 Pro (High) Google 45.7%
Opus 4.5 (On) Anthropic 38.3%
Gemini 2.5 Flash (On) Google 35.7%
Sonnet 4.5 (On) Anthropic 35.5%
Opus 4.1 (On) Anthropic 33.8%
Gemini 2.5 Pro (On) Google 31.9%

It would be tempting to turn this into a horse race. GPT 5 wins overall. o3 Pro and GPT 5.1 are close. Gemini 3 Pro leads the non-OpenAI group. Shopping remains brutal. Roll credits.

That would miss the paper’s more useful point.

The leaderboard is main evidence, but the mechanism explains why the evidence matters. ACE shows that web-enabled frontier models still struggle when correctness depends on satisfying the user’s actual intent and grounding empirical claims. The best model does not fail because it cannot write fluent recommendations. It fails because real consumer assistance requires a chain of small verifications, and each verification is an opportunity for the model to drift from reality.

The domain split makes this clearer:

Domain Best reported score Best model Interpretation
Food 70.1% GPT 5 Structured constraints are hard, but still more tractable than live-web shopping.
Gaming 61.3% o3 Pro Strategy and recommendation tasks are moderately tractable, but compatibility and links remain difficult.
DIY 55.8% GPT 5.1 Procedural guidance is not enough; safety, materials, and judgment remain uneven.
Shopping 45.4% o3 Pro The live-web, price, link, vendor, and compatibility burden is still poorly handled.

Shopping is the important failure case because it looks like the obvious consumer monetization path. AI shopping assistants are easy to imagine, easy to demo, and easy to make superficially impressive. ACE suggests they are also easy to overtrust.

That is the sort of result product teams should read twice before adding an affiliate link and calling it strategy.

The failure pattern: models are better at structure than verification

ACE’s criteria-type results are more informative than the overall leaderboard because they show where the competence actually sits.

Models perform relatively well on simple, structured criteria: step-by-step instructions, quantity requirements, ingredient lists, and set-list style recommendations. For example, in DIY, “provides step-by-step instructions” is very strong across models, with top scores near or at the high nineties. In Food, many models do reasonably well on dish features, quantity/duration requirements, and set-list recommendations.

But the scores weaken when the task requires judgment, compatibility, safety, or live factual support.

Criteria pattern What the paper shows Practical reading
Step-by-step instructions Strong scores across models in DIY Models are good at procedural formatting and generic instruction.
Quantity requirements Strong scores in Food, Gaming, and Shopping Explicit numeric constraints are comparatively manageable.
Safety warnings Lower scores in DIY than procedural steps Models may explain how without reliably knowing when not to.
Compatibility requirements Weak in Gaming and central to Shopping workflows Matching constraints across systems/products remains hard.
Prices and product/vendor features Very weak in Shopping, sometimes negative Web-grounded commerce facts are a major failure zone.
Links Poor in Shopping; mixed in Gaming A link in the answer is not the same thing as a verified source.

The most revealing Shopping rows are ugly in exactly the way a business should care about. For “meets pricing requirements/gives price,” several models score near zero or negative, while even stronger models remain low. For “provides link(s),” some scores are negative or barely positive. Gemini 3 Pro’s Shopping link score is reported as -54, while GPT 5 reaches only 4, GPT 5.1 reaches 15, and o3 Pro reaches 1.

That does not mean these models cannot search the web. It means that search-enabled response generation is not the same as source-grounded transaction support.

This distinction is easy to ignore in demos. A model can cite a source, mention a vendor, produce a price, and sound competent. But if the verification layer is weak, the user sees a recommendation while the system has actually produced an attractive liability object.

A very modern artifact.

Grounding reveals a second axis of model behavior

ACE’s grounding analysis separates two abilities that are often blended together: satisfying the surface requirement and supporting the factual claim.

This matters because some models are relatively good at producing answer-shaped outputs but weaker at grounding them. The paper reports that Gemini 3 Pro is the least grounded among the tested models, passing 38.0% of grounding tests, while GPT 5.1 is the most grounded, passing 70.8%.

The paper also compares each model’s pass rate on all criteria with its pass rate on grounding criteria. Some models drop sharply on grounding, suggesting they are better at creating responses that look compliant than at anchoring claims to retrieved sources. Others do relatively better on grounding than on general prompt satisfaction, suggesting a different tradeoff: more cautious or more source-aligned, but less broadly successful at meeting the full task.

This is useful because “best model” is not a single procurement property.

A company building an AI meal-planning assistant may care about constraint satisfaction and user preference elicitation. A shopping assistant needs source-grounded prices, product features, and purchase links. A DIY assistant needs procedural clarity, safety triage, and professional escalation. A game recommendation system needs compatibility checks and current platform availability.

ACE implies that consumer AI systems should be selected and configured by failure mode, not just by leaderboard rank.

The dev-set comparison is a robustness check, not a second headline

ACE includes an open dev set of 80 cases. The authors evaluate the same 10 models on this dev set using the same methodology. All models score higher on ACE-v1-dev than on ACE-v1-heldout, but the differences are less than five percentage points, and no model moves more than two rank positions.

This result is useful, but it should be interpreted carefully.

Test or table Likely purpose What it supports What it does not prove
ACE-v1-heldout leaderboard Main evidence Frontier models remain far from reliable consumer task completion. It does not prove one model will dominate every consumer product context.
Criteria-type table Diagnostic analysis Failures concentrate in grounding, links, pricing, compatibility, safety, and nuanced requirements. It does not isolate the cause inside model weights, retrieval, tool-use, or grading.
Grounding pass-rate analysis Mechanism evidence Some models are more likely to produce superficially compliant but ungrounded answers. It does not fully eliminate measurement error in web extraction or judge evaluation.
ACE-v1-dev comparison Robustness / representativeness check The open dev set is broadly similar in difficulty and ranking pattern to the hidden set. It does not remove contamination risk once the dev set is public.
Bootstrapped confidence intervals Statistical uncertainty estimate Reported means have measurable uncertainty across sampled cases. It does not address future drift as the internet changes.
Appendix workflow taxonomy Implementation detail with interpretive value The benchmark covers distinct consumer workflows, not just broad domains. It does not claim full coverage of all consumer AI use cases.

That last distinction matters. The dev-set comparison is not a new thesis about model progress. It is evidence that the public dev set is not wildly disconnected from the hidden heldout benchmark. Useful, but not magical. Public data can still become training data, and training data has a way of wandering into models wearing a fake mustache.

What ACE directly shows, and what businesses should infer

The safest business reading of ACE is not “choose the model with the highest score.” That is the obvious reading. Also the shallow one.

The deeper implication is that consumer AI needs workflow-level evaluation architecture.

Paper result What it directly shows Cognaptus inference for business use
Top overall score is 56.1% Even leading web-enabled models fail many consumer tasks under ACE scoring. Autonomous consumer-facing deployment needs guardrails, not just better prompting.
Shopping tops out at 45.4% Price, link, vendor, and compatibility tasks remain weak. Commerce assistants need independent product-data verification before recommendations are shown.
Grounding failures can receive negative scores Unsupported factual claims are treated as worse than missing claims. Evaluation should penalize confident misinformation more heavily than uncertainty.
Hurdles gate task scores Core user intent must be satisfied before details matter. Product QA should define non-negotiable acceptance criteria for each workflow.
Dev set is slightly easier but broadly similar The open cases are useful for development, but not identical to hidden evaluation. Teams can use dev-style cases internally but must maintain private regression tests.

For product teams, the practical architecture is not complicated in concept. It is just annoying enough that many teams will skip it until something breaks.

A serious consumer AI workflow should include:

  1. Intent hurdle checks. Did the system satisfy the user’s actual core request?
  2. Constraint checks. Did every recommendation meet stated preferences, budget, compatibility, dietary, safety, and availability constraints?
  3. Source grounding. Are prices, links, product features, vendor claims, and factual assertions supported by current sources?
  4. Escalation rules. Should the model refuse, defer, ask a follow-up, recommend a professional, or route to a human?
  5. Regression cases. Does the assistant keep passing realistic consumer tasks as models, prompts, tools, and the web change?

This is not glamorous. It does not look as impressive in a launch video as a cheerful assistant producing 12 recommendations in two seconds. But it is where consumer trust is actually built.

The business value is not “AI can shop”; it is cheaper reliability diagnosis

ACE’s most immediate value for businesses is diagnostic.

It helps teams identify which layer of the system is failing. A bad answer might come from the model, the retrieval system, the prompt, the source extraction pipeline, the product database, the grading rule, or the lack of a follow-up question. ACE does not solve all of those problems, but its structure makes the failure easier to name.

That naming matters because different failures require different fixes.

If the model fails the hurdle, the issue may be task interpretation or instruction following. If it satisfies the criterion but fails grounding, the issue may be retrieval, source selection, link extraction, or unsupported synthesis. If it provides many recommendations and one is invalid, the system may need stricter list-level validation. If it gives a correct procedure but omits a safety warning, the product needs risk classification and escalation logic.

In other words, ACE points away from a single-model worldview.

A reliable consumer AI product will likely be a system: model plus tools, retrieval, databases, validators, policy rules, user-profile elicitation, UI constraints, and monitoring. The model is the most visible component, which is why everyone stares at it. The less visible components are where much of the reliability will be won.

How inconvenient for slide decks.

Where ACE should not be overread

ACE is valuable, but its boundaries are important.

First, ACE-v1 is text-only. The authors explicitly note plans to expand toward images, audio, and video. That matters because many consumer tasks are multimodal. A DIY repair assistant may need to inspect a photo. A shopping assistant may need to compare product images. A food assistant may need pantry recognition. A gaming assistant may depend on screenshots or gameplay clips.

Second, ACE covers four domains. They are high-value domains, but they are not all of consumer AI. The paper mentions future expansion into areas such as consumer finance and travel. Those domains would likely introduce different risk profiles, especially around regulation, liability, identity, and payment behavior.

Third, personas are explicit in the benchmark prompts. Real users often do not articulate their preferences so neatly. They may say “find me something good,” while silently expecting the assistant to infer budget, taste, context, risk tolerance, and hidden constraints. The paper recognizes this and suggests that multi-turn conversations may better reflect how preferences are naturally elicited.

Fourth, grounding checks can have measurement error. ACE must identify URLs, extract web content, identify claims, map claims to relevant links, and judge whether sources support the answer. Any one of those steps can fail. That does not invalidate the benchmark, but it means grounding scores should be treated as measurement with noise, not divine revelation delivered by a scraper.

Fifth, the internet changes. Shopping is not a static benchmark domain. Prices change, listings vanish, vendors alter pages, and social content mutates. ACE’s authors expect the benchmark to be refreshed and rerun regularly. That is not a side note. It is central to consumer AI evaluation. A shopping answer can expire faster than a carton of milk in Manila traffic.

The real lesson: helpfulness must be audited at the claim level

The popular fantasy of consumer AI is that better models will naturally become better assistants. ACE suggests a more disciplined version: better models help, but consumer reliability also requires better evaluation mechanics.

The benchmark’s mechanism-first lesson is simple:

  • First, test whether the user’s core goal was met.
  • Then test whether the important criteria were satisfied.
  • Then test whether factual claims are grounded.
  • Then punish confident fabrication more severely than omission.

That sequence is the heart of the paper. It is also a useful template for any business deploying AI in consumer-facing workflows.

ACE does not say frontier models are useless. Quite the opposite: a model scoring 70.1% in Food under a demanding benchmark is not trivial. The systems are capable, increasingly useful, and often impressive. But “impressive” is not the same as “ready to act without verification.”

For consumers, the lesson is: do not confuse formatting with reliability.

For product teams, the lesson is: do not confuse model access with product assurance.

For executives, the lesson is: if your AI assistant affects spending, safety, compatibility, or trust, you need acceptance tests and grounding checks before scale. Otherwise, your product strategy is basically vibes with an API key.

And vibes, unfortunately, do not validate purchase links.

Cognaptus: Automate the Present, Incubate the Future.


  1. Julien Benchek, Rohit Shetty, Benjamin Hunsberger, Ajay Arun, Zach Richards, Brendan Foody, Osvald Nitski, and Bertie Vidgen, “The AI Consumer Index (ACE),” arXiv:2512.04921, 2025. https://arxiv.org/abs/2512.04921 ↩︎