Shopping looks easy until someone has to calculate the customs duty.
That is roughly the lesson of EcomBench, a new benchmark designed to evaluate foundation agents on realistic e-commerce tasks.1 The paper’s most useful finding is not that one model ranks above another. Leaderboards are entertaining, in the same way airport departure boards are entertaining when your flight is already delayed. The useful finding is the shape of failure.
On the overall benchmark, the best reported score is 65%. The second-best is 64%. Then the field drops into the mid-50s and below. More importantly, when tasks move from Level 1 to Level 3, the leading models fall from above 90% accuracy to 46%. For the rest of the evaluated models, Level 3 performance generally stays below 35%.
That collapse is the article. Everything else explains why it happens.
The common business misconception is simple: if a flagship agent can answer a product question, summarize a policy page, and calculate a price, it must be close to ready for e-commerce operations. EcomBench says: not quite. A model may look competent when the task is a clean question with a nearby answer. It becomes much less charming when the answer requires regulatory interpretation, cross-source retrieval, multi-step arithmetic, market context, and a final decision that must be both correct and operationally useful.
In other words, e-commerce is not a shopping interface problem. It is a constraint-combination problem.
The headline result is a difficulty cliff, not a leaderboard
EcomBench evaluates a group of current agents and models, including ChatGPT-5.1, Gemini DeepResearch, Flowith Agent, SuperGrok Expert, GenSpark Agent, DeepSeek-Chat, Manus Agent, Doubao DeepResearch, MiniMax Agent, Coze Space Agent, Quark Agent, and Skywork General. The overall ranking places ChatGPT-5.1 at 65%, Gemini DeepResearch at 64%, Flowith Agent at 56%, SuperGrok Expert at 55%, and GenSpark Agent at 53%. The remaining systems sit between 51% and 43%.
Those numbers are already sobering. A 65% score in a business operations setting is not “pretty good.” It is a reminder that demos and deployment are two separate species.
But the more important evidence appears when the benchmark is split by difficulty level. The paper defines three levels. Level 1 covers relatively simple cases that test foundational e-commerce expertise and basic tool use. Level 2 requires decomposition and multi-step reasoning. Level 3 is the hardest group, built around cross-source integration, deep retrieval, and long-horizon reasoning and planning.
The performance pattern is clean:
| Evaluation slice | Likely purpose in the paper | Main result | Business interpretation |
|---|---|---|---|
| Overall model comparison | Main evidence | Top scores are 65% and 64% | Current agents are useful but not yet broadly dependable for practical e-commerce expertise |
| Difficulty-level comparison | Main evidence and construct validation | Top models exceed 90% on Level 1 but fall to 46% on Level 3 | Simple task success does not transfer reliably to complex operational requests |
| Category comparison | Exploratory domain comparison | Different models lead in policy-, finance-, and strategy-related categories | Model selection should become routing and evaluation, not brand loyalty |
| Difficulty examples | Benchmark design evidence | Level 3 tasks require more cross-source reasoning and constraint integration | The benchmark is testing operational reasoning, not trivia retrieval |
| Quarterly update design | Implementation detail and contamination control | The benchmark is intended to refresh as models and markets change | Static benchmark scores may age quickly in e-commerce |
The paper’s Figure 3 is especially useful because it turns a vague complaint — “agents struggle with hard tasks” — into a measurable curve. ChatGPT-5.1 and Gemini DeepResearch perform above 90% on Level 1. Both fall to 46% on Level 3. Flowith Agent, despite reaching 95% on Level 1 and 76.7% on Level 2, drops to 28% on Level 3. SuperGrok Expert moves from 80% to 73.3% to 34%.
That is not a gentle degradation. It is a cliff.
For business readers, the meaning is direct: testing an agent on easy support questions tells you almost nothing about its ability to handle complex seller operations. The agent that answers “what is the refund policy?” may still fail when asked to combine a regulation, a product specification, a tax rule, a logistics constraint, and a numerical calculation. E-commerce is full of exactly those questions, because merchants rarely operate inside one neat category at a time.
The hard cases look boring, which is precisely why they matter
One reason e-commerce is a good stress test is that its difficulty is not theatrical. The hard tasks do not need dragons, puzzles, or philosophical traps. They look like work.
The paper gives examples across difficulty levels. A Level 1 policy question asks about the maximum no-load power consumption allowed for a 48-watt laptop power adapter sold in the United States in 2025. A Level 1 pricing question asks for cumulative growth over three years given a market CAGR and a faster-growing subcategory. These are not trivial, but they are bounded. Find the rule or apply the formula.
By Level 2, the benchmark starts combining rules and calculations. One example asks for the probability that a batch of toys is incorrectly accepted under an AQL sampling standard when 2% exceed a lead limit. Another asks for the total amount a German consumer must pay for a UK product bundle involving electronics, books, a digital course, exchange rates, VAT rules, customs duties, and a configuration fee.
Then Level 3 turns the screw. One task asks for the minimum allowable input power for a 48-watt laptop adapter under DOE Level VI efficiency standards. Another asks about an intelligent doorbell operating in the 5.8 GHz band under the EU Radio Equipment Directive, requiring EIRP calculation, out-of-band emission attenuation under EN 300 328, and a compliance judgment.
This is the important part: the hard tasks are not hard because they are “creative.” They are hard because they require disciplined retrieval, interpretation, calculation, and final answer formatting. That is exactly the type of boring precision that business automation quietly depends on.
A weak agent does not necessarily fail with a dramatic hallucination. It may retrieve the wrong regulatory threshold, miss one condition, use a rounded intermediate value incorrectly, or answer only two of the three requested components. Very efficient. Very wrong.
EcomBench is built from demand first, then made harder
The paper’s benchmark construction matters because it tries to avoid one of the classic benchmark traps: synthetic questions that are clean, elegant, and only distantly related to what users actually ask.
EcomBench starts from real-world user demands embedded in major e-commerce ecosystems, including examples such as Amazon. The authors then transform these raw demands into seed questions with verifiable answers. An LLM is used to filter out demands that lack concrete answers, but the authors emphasize that final question reconstruction and labeling rely mainly on human experts rather than pure LLM synthesis.
This is not a decorative methodological point. It affects what the benchmark measures.
If questions are generated mostly from model imagination, they may reflect what a model thinks e-commerce difficulty looks like. If questions start from real user demand and are refined by domain experts, they are more likely to capture the irritating messiness of actual operations: unclear intent, domain-specific terms, compliance details, tax conditions, fulfillment constraints, and decision contexts.
The authors also use peer validation. Each question is independently labeled by at least three experts, and questions with inconsistent answers are discarded. That choice supports answer reliability, though it does not eliminate all judgment risk. E-commerce rules change, regulations are context-sensitive, and some tasks may have edge cases outside the benchmark’s formulation. Still, the curation process is more serious than “ask an LLM to invent some hard questions and hope the vibes are enterprise-grade.”
The more distinctive design choice is the tool hierarchy approach for Level 3 task selection. The authors equip an LLM with specialized e-commerce tools, such as product price retrieval and trend analysis, then use rejection sampling to retain questions that require complex reasoning chains and cannot be solved in only a few action steps.
The logic is subtle but useful. A specialized tool can compress a task that would otherwise require many atomic browsing or search actions. If a question still requires complex reasoning even with better tools, it is probably a genuinely difficult task for ordinary agents. Level 3 is therefore not just “harder because the authors say so.” It is harder because the task demands more planning, retrieval depth, and integration across steps.
For companies evaluating agents, this suggests a practical design principle: difficulty should be defined by workflow burden, not by how impressive the prompt sounds.
Seven categories make e-commerce a real operational domain
EcomBench covers seven e-commerce task categories:
| Category | What it tests operationally |
|---|---|
| Policy Consulting | Platform rules, qualification submissions, tax registration, compliance workflows |
| Cost and Pricing | Profit checks, quotes, price adjustment, market/customer pricing conditions |
| Fulfillment Execution | Shipping, returns and exchanges, logistics route improvement |
| Marketing Strategy | Promotions, ads, traffic growth, visibility planning |
| Intelligent Product Selection | Trend signals and data insights for product category selection |
| Opportunity Discovery | Early signals for new growth directions |
| Inventory Control | Safety stock, replenishment, clearance, stockout and overstock balance |
This taxonomy is more valuable than it first appears. It makes clear that “e-commerce agent” is not one job. It is several jobs wearing the same badge.
A policy task may require legal-style reading and exact rule interpretation. A pricing task may require numerical consistency and tax logic. A fulfillment task may require operational sequencing. A marketing task may require strategic judgment under incomplete evidence. Product selection and opportunity discovery may involve trend interpretation, where a correct answer may depend on both data and framing. Inventory control lives in the unpleasant zone where math meets business risk.
So when a vendor says an agent is “good at e-commerce,” the correct response is: which part? Policy? Pricing? Logistics? Marketing? Product discovery? Inventory? All of them? On Level 1 or Level 3?
Annoying questions, yes. Also known as procurement.
Category results point toward model routing, not model worship
The paper’s category-level evaluation groups the seven tasks into three larger domains: Policy-Related, Finance-Related, and Strategy-Related.
The results are not uniform. ChatGPT-5.1 leads overall and also leads the Policy-Related group at 64.9%, closely followed by Gemini DeepResearch at 63.2%. In Finance-Related tasks, SuperGrok Expert leads at 70.6%, ahead of ChatGPT-5.1 and Flowith Agent, both at 64.7%. In Strategy-Related tasks, Gemini DeepResearch leads at 69.2%, followed by ChatGPT-5.1 at 65.4%.
That pattern matters because it weakens the simplest enterprise buying story: choose the strongest general model and deploy it everywhere.
The paper does not prove that a routing system would outperform every single-agent setup in production. It does, however, provide a strong reason to test for routing. If different models show different relative strengths across policy, finance, and strategy tasks, a serious e-commerce agent architecture should not assume one model is always best. It should consider task classification, model routing, specialist tools, and confidence-aware escalation.
A possible business architecture looks like this:
User request
↓
Task classifier
↓
Domain route: policy / finance / fulfillment / marketing / inventory / product selection / opportunity discovery
↓
Specialized retrieval + tool bundle
↓
Model or agent selected by domain performance
↓
Verifier checks answer format, source grounding, calculation, and policy constraints
↓
Human review only for high-risk or low-confidence cases
This is an inference from the paper, not something the paper directly implements. The paper reports benchmark performance and domain variation. The business architecture follows from those results: if agent strengths vary by category, then deployment should evaluate and exploit that variation rather than pretending all tasks are one task.
The phrase “AI agent” is too broad to be useful here. E-commerce needs an agent system with role separation.
The benchmark is also a procurement tool in disguise
The business value of EcomBench is not merely academic. It changes what a buyer should ask before trusting an agent inside an e-commerce workflow.
A weak evaluation process asks:
Can the agent answer e-commerce questions?
A better evaluation process asks:
Can the agent answer our hardest recurring operational questions, with correct calculation, correct policy interpretation, correct source use, and a decision-ready final response?
EcomBench helps push evaluation toward the second question. Its design suggests at least five procurement tests.
First, test by task category. A seller operations team should not accept a single aggregate score. Separate compliance, pricing, fulfillment, marketing, product selection, opportunity discovery, and inventory control. A model that is strong in pricing may be mediocre in strategy.
Second, test by difficulty level. Level 1 success should be treated as a warm-up. Level 3 is where operational reliability starts to become visible. If a vendor shows only simple support tasks, assume the hard part is still hiding backstage, probably charging consulting fees.
Third, test answer verifiability. EcomBench uses uniquely verifiable ground-truth answers. Businesses should imitate this. Before deploying agents into live workflows, build an internal set of questions with known answers, expected reasoning paths, and strict output requirements.
Fourth, test tool dependence. Many e-commerce tasks require current information: regulations, marketplace policies, shipping rules, tax rates, product trends, and market prices. A model’s static knowledge is not enough. The question is whether the agent can retrieve, interpret, and use fresh information correctly.
Fifth, refresh the test set. The authors plan quarterly updates because both models and e-commerce conditions change. This is not just a benchmark maintenance detail. It is a governance principle. In a dynamic domain, a one-time evaluation becomes stale quickly.
That last point is especially important for AI automation teams. You cannot evaluate an e-commerce agent once, write “approved” in a spreadsheet, and then walk away feeling managerial. Policies change. Product categories shift. Models update. Tools break. The benchmark has to move because the business moves.
What the paper directly shows, and what Cognaptus infers
It is useful to separate evidence from business interpretation.
| Layer | What belongs here | What we can responsibly say |
|---|---|---|
| Direct paper result | EcomBench construction and model scores | Current agents perform well on easier tasks but struggle sharply on high-difficulty e-commerce tasks |
| Strong practical implication | Evaluation design | Businesses should test agents on realistic, category-specific, difficulty-stratified workflows |
| Reasonable Cognaptus inference | Deployment architecture | Model routing, specialist tools, verification, and escalation are likely safer than one general agent handling everything |
| Still uncertain | Production outcomes | The benchmark does not prove ROI, transaction safety, or end-to-end operational reliability in live environments |
This separation matters because benchmark results are often abused in two opposite ways. Optimists turn them into sales slides. Skeptics dismiss them as artificial. EcomBench deserves neither treatment.
It is not a production audit. But it is also not a toy benchmark. It captures enough operational structure to expose where current agents break: not at the level of speaking fluently, but at the level of integrating constraints correctly.
That is exactly where many business failures happen.
The limits are real, but they do not weaken the central warning
The paper is explicit about its limitations.
EcomBench currently focuses on question-answering tasks. It does not directly evaluate agents in interactive environments. That means it does not test the full loop of observing a marketplace interface, clicking through workflows, updating records, handling failed API calls, or executing transactions. Those are important missing pieces for production e-commerce automation.
The benchmark also does not yet fully cover predictive tasks, such as product selection and market trend forecasting. The authors plan to incorporate these in future releases. This matters because many high-value e-commerce decisions are forward-looking. “What is the rule?” is different from “which product category should we enter next quarter?” The first can often be verified cleanly. The second depends on uncertain data, assumptions, and business appetite for risk.
There is also a scoring boundary. The paper uses an LLM judge to compare model outputs with ground-truth answers and assigns binary correctness, with manual inspection of a subset of evaluations. This is a reasonable approach for scalable benchmarking, especially when answers may be semantically equivalent but phrased differently. Still, binary scoring can hide partial competence. A model that gets the calculation right but misses the compliance judgment receives the same failure label as a model that gets everything wrong. For deployment, partial failure modes matter.
Finally, the human-curated and dynamically updated nature of the benchmark is both a strength and a cost. Expert validation improves quality, but it also means long-term maintenance requires sustained effort. Living benchmarks are useful because the world moves. They are expensive for the same reason. Reality, annoyingly, does not maintain itself.
None of these limitations cancels the main finding. If anything, they make it more practical. EcomBench tests a narrower setting than full production automation, and agents still struggle badly on the hardest tasks. The full production setting is not easier.
The final boss is not the shopping cart
E-commerce is often treated as a friendly commercial domain: products, prices, promotions, customers, carts. But for agents, it is a dense operational environment. Policies change. Tax rules vary. Fulfillment details matter. Inventory decisions have cash-flow consequences. Product selection depends on trend signals that age quickly. Marketing advice without context is just confident decoration.
EcomBench’s contribution is to make that complexity measurable. It gives researchers and businesses a clearer way to ask whether agents can handle realistic e-commerce expertise, not just generic internet fluency.
The evidence-first lesson is simple. When tasks are easy, flagship agents look strong. When tasks become realistic, the same agents lose reliability fast. That is not a reason to abandon e-commerce agents. It is a reason to design them like serious operational systems: benchmarked by category, routed by capability, grounded in tools, verified before action, and refreshed as the market changes.
The future of e-commerce automation will not be won by the agent that talks the most smoothly. It will be won by the system that knows when a question is simple, when it is dangerous, which tool to use, which model to trust, and when to stop before turning a compliance problem into a customer service incident.
A little less magic. A little more operations. Tedious, perhaps. Also how businesses survive.
Cognaptus: Automate the Present, Incubate the Future.
-
Rui Min et al., “EcomBench: Towards Holistic Evaluation of Foundation Agents in E-commerce,” arXiv:2512.08868, 2025. https://arxiv.org/abs/2512.08868 ↩︎