Autonomous Agents

World-Building for Agents: When Synthetic Environments Become Real Advantage

Opening — Why this matters now Everyone wants “agentic AI.” Few are prepared to train it properly. As large language models evolve into tool-using, multi-step decision makers, the bottleneck is no longer raw model scale. It is environment scale. Real-world reinforcement learning (RL) for agents is expensive, fragile, and rarely reproducible. Public benchmarks contain only a handful of environments. Real APIs throttle you. Human-crafted simulations do not scale. ...

Confidence Is Not Truth, But It Can Steer: When LLMs Learn When to Stop

Opening — Why this matters now Large Language Models are no longer compute-bound at training time. They are inference-bound at deployment time. The last year has made this painfully clear. Frontier reasoning models increasingly win benchmarks not by being smarter, but by thinking more: longer chains-of-thought, more samples, more retries, more votes. The result is an arms race in test-time scaling—512 samples here, best-of-20 there—where accuracy inches upward while token bills explode. ...

Drafts, Then Do Better: Teaching LLMs to Outgrow Their Own Reasoning

Opening — Why this matters now Large language models have learned to sound confident. Unfortunately, confidence is not correctness—especially in long-horizon reasoning tasks like competition math or multi-step logic. Reinforcement learning has helped, but most RL pipelines still assume a one-shot world: generate once, score once, update once. Humans don’t work that way. We draft, reread, cringe, fix, and try again. ...

Agents Need Worlds, Not Prompts: Inside ScaleEnv’s Synthetic Environment Revolution

Opening — Why this matters now The past two years of agent research have been oddly paradoxical. Models have grown more capable, benchmarks more elaborate, yet agent failures remain stubbornly familiar: brittle tool calls, shallow exploration, and a suspicious tendency to memorize solution templates. The culprit, ScaleEnv argues, is not the agent—but the world it is trained in. ...

From Features to Actions: Why Agentic AI Needs a New Explainability Playbook

Opening — Why this matters now Explainable AI has always promised clarity. For years, that promise was delivered—at least partially—through feature attributions, saliency maps, and tidy bar charts explaining why a model predicted this instead of that. Then AI stopped predicting and started acting. Tool-using agents now book flights, browse the web, recover from errors, and occasionally fail in slow, complicated, deeply inconvenient ways. When that happens, nobody asks which token mattered most. They ask: where did the agent go wrong—and how did it get there? ...

When Agents Believe Their Own Hype: The Hidden Cost of Agentic Overconfidence

Opening — Why this matters now AI agents are no longer toy demos. They write production code, refactor legacy systems, navigate websites, and increasingly make decisions that matter. Yet one deceptively simple question remains unresolved: can an AI agent reliably tell whether it will succeed? This paper delivers an uncomfortable answer. Across frontier models and evaluation regimes, agents are systematically overconfident about their own success—often dramatically so. As organizations push toward longer-horizon autonomy, this blind spot becomes not just an academic curiosity, but a deployment risk. ...

When Aligned Models Compete: Nash Equilibria as the New Alignment Layer

Opening — Why this matters now Alignment used to be a single‑model problem. Train the model well, filter the data, tune the reward, and call it a day. That framing quietly breaks the moment large language models stop acting alone. As LLMs increasingly operate as populations—running accounts, agents, bots, and copilots that interact, compete, and imitate—alignment becomes a system‑level phenomenon. Even perfectly aligned individual models can collectively drift into outcomes no one explicitly asked for. ...

Learning to Inject: When Prompt Injection Becomes an Optimization Problem

Opening — Why this matters now Prompt injection used to be treated as a craft problem: clever wording, social engineering instincts, and a lot of trial and error. That framing is now obsolete. As LLMs graduate from chatbots into agents that read emails, browse documents, and execute tool calls, prompt injection has quietly become one of the most structurally dangerous failure modes in applied AI. ...

Benchmarks Lie, Rooms Don’t: Why Embodied AI Fails the Moment It Enters Your House

Opening — Why this matters now Embodied AI is having its deployment moment. Robots are promised for homes, agents for physical spaces, and multimodal models are marketed as finally “understanding” the real world. Yet most of these claims rest on benchmarks designed far away from kitchens, hallways, mirrors, and cluttered tables. This paper makes an uncomfortable point: if you evaluate agents inside the environments they will actually operate in, much of that apparent intelligence collapses. ...

First Proofs, No Training Wheels

Opening — Why this matters now AI models are now fluent in contest math, symbolic manipulation, and polished explanations. That’s the easy part. The harder question—the one that actually matters for science—is whether these systems can do research when the answer is not already in the training set. The paper First Proof arrives as a deliberately uncomfortable experiment: ten genuine research-level mathematics questions, all solved by humans, none previously public, and all temporarily withheld from the internet. ...