Bench to the Future: Why E-commerce Is the Real Final Boss for Foundation Agents
Opening — Why this matters now Foundation agents have finally escaped the lab. They browse the web, query APIs, plan multi-step workflows, and increasingly intervene in high‑stakes business operations. Yet for all the hype, one stubborn truth remains: most benchmarks still measure agent performance in toy universes—mazes, puzzles, synthetic tasks. Real businesses, unfortunately, do not operate in puzzles. ...