Rollouts

TL;DR for operators AWorld’s useful lesson is not “buy more GPUs”. It is more specific, and therefore more operationally annoying: if an agent learns from interaction, the bottleneck becomes the rate at which it can safely attempt tasks, collect trajectories, score outcomes, and feed those traces back into training. The paper shows three things that matter for builders. First, more rollouts per task sharply raise success rates on GAIA validation: Claude 3.7 Sonnet rises from 47.9% pass@1 to a 76.4% peak, while GPT-4o rises from 27.3% to 65.5% as rollout count increases to 32. Second, AWorld’s distributed executor cuts rollout time for one training cycle from 7,695 seconds to 525 seconds, while training time stays fixed at 144 seconds. That is the paper’s 14.6× speedup, and it is the result that makes the training loop economically less ridiculous. Third, using that loop, Qwen3-32B-AWorld reaches 32.23% GAIA test pass@1, up from 21.59% for the base Qwen3-32B model, and improves xbench-DeepSearch from 12% to 32% without direct training on that benchmark. ...