When LLMs Get a Laptop: Why Sandboxes Might Be the Real AGI Benchmark

Laptop.

That is the deceptively simple object hiding inside this paper. Not a magic planner. Not a thousand-tool agent marketplace. Not a baroque workflow with seventeen orchestration layers and a dashboard that looks like a cockpit designed by consultants.

A laptop.

Or, more precisely, a minimal virtual computer: a sandbox with terminal access, file editing, code execution, persistent files, and the ability to install or fetch resources. In Computer Environments Elicit General Agentic Intelligence in LLMs, Cheng et al. ask a question that looks almost too obvious to be interesting until one remembers how much of the AI industry is still trying to squeeze “agency” out of longer prompts.¹

What happens when an LLM is not merely asked to answer, but is placed inside a basic computer environment where it can inspect files, run scripts, search for tools, verify outputs, and leave behind actual artifacts?

The paper’s answer is not that sandboxes make every model brilliant. They do not. Weak models can wander around the terminal like interns who just discovered ls. The more useful finding is narrower and more important: for sufficiently capable models, a minimal computer environment changes the problem-solving mechanism. It lets the model externalize work into files, convert vague reasoning into executable checks, and acquire task-specific tools on demand.

That is why this paper is more interesting than another benchmark table. The benchmark gains matter. But the mechanism matters more.

The sandbox is not a tool list; it is a new work surface

The common misconception is easy to predict: a sandbox sounds like a coding accessory. Give the model Bash, Python, and a file editor, and of course it will do better on programming tasks. Very dramatic. Someone alert the procurement team.

But this paper deliberately focuses on non-code domains first: mathematics, physics, chemistry, biomedicine, long-context understanding, and instruction following. The sandbox is not evaluated merely as a better coding interface. It is treated as a general environment for intellectual work.

The design is intentionally minimal. LLM-in-Sandbox gives the model only three core tools:

Sandbox component	What it allows	Why it matters
`bash`	Run commands, install packages, execute scripts, access network resources	Turns reasoning into action and feedback
`file_editor`	View, create, and modify files	Makes information persistent and inspectable
`finish`	End the task and submit output	Separates exploration from final response

The point is not that these are fancy tools. They are almost embarrassingly basic. That is the point.

If a highly optimized agent system improves performance, we can always wonder whether the gain came from the base model, the prompt, a hidden retrieval pipeline, a benchmark-specific environment, or a carefully engineered workflow. LLM-in-Sandbox strips the setup down to the computer equivalent of “here is a terminal, a working directory, and permission to try things.”

From that minimal setup, the authors identify three transferable meta-capabilities:

External resource access: the model can fetch resources, install libraries, and obtain domain-specific tools.
File management: the model can read, search, organize, and persist information outside the context window.
Code execution: the model can run computations, simulations, constraint checks, and verification scripts.

These are not just conveniences. They change the unit of cognition. A text-only model must keep the task, evidence, intermediate reasoning, and final answer inside the same fragile stream. A sandboxed model can distribute cognition across a filesystem, executable code, logs, and generated outputs.

That is closer to how human knowledge workers use computers. We do not solve a financial model by staring at the keyboard and hallucinating Excel.

The strongest evidence is behavioral, not just numerical

The headline result is straightforward: without additional training, strong models often improve when placed in LLM-in-Sandbox mode. The paper reports gains up to +15.5 percentage points in mathematics and +14.4 points in instruction following. Across the tested domains, the gains are uneven but broad enough to be meaningful.

A few examples from the main evaluation:

Model / domain	Vanilla LLM	LLM-in-Sandbox	Change
Qwen3-Coder-30B, Mathematics	26.0	41.5	+15.5
GPT-5, Mathematics	87.8	97.9	+10.1
DeepSeek-V3.2-Thinking, Instruction Following	60.3	74.7	+14.4
MiniMax-M2, Chemistry	54.0	68.4	+14.4
Claude-Sonnet-4.5-Thinking, Instruction Following	59.3	72.0	+12.7

These numbers should not be read as “sandboxes always help.” The same table contains failures. GPT-5 drops on the biomedicine benchmark from 55.8 to 49.0. MiniMax-M2 drops on instruction following from 73.0 to 61.3. Qwen3-4B-Instruct performs worse in several sandbox settings, including mathematics, chemistry, and instruction following.

That variation is the interesting part. The environment is not a performance potion. It is a work surface. A model must know how to use it.

The authors therefore do something useful: they inspect what the models actually do inside the sandbox. The case studies make the mechanism concrete.

In a chemistry task, the model is asked to predict molecular properties from compound names. Instead of relying only on internal memory, it installs chemistry-related tools and uses OPSIN to convert chemical names into molecular structures. That is external resource access behaving like on-demand domain augmentation.

In a long-context task, the model is given lengthy industry reports exceeding 100K tokens. It does not simply swallow the whole document into a heroic context window and hope the relevant line survives attention decay. It lists files, uses grep to search for terms, uses sed to jump to relevant line ranges, and writes extraction scripts. That is file management turning long-context reasoning into document operations.

In an instruction-following task, the model must generate three sentences about medieval history with identical character counts and no overlapping words. Text-only generation is poorly suited to this kind of constraint. The sandboxed model writes scripts to count characters, detect overlaps, and search over candidate sentences. That is code execution turning “please obey this annoying constraint” into a verifiable procedure.

The paper’s quantitative behavior analysis supports this interpretation. For strong models, mathematics shows high computation usage; chemistry shows high external-resource usage; long-context tasks show high file-operation usage. The tool behavior changes with the task. That matters because generic tool use would be less impressive. A model typing random shell commands is not an agent. It is just making the terminal suffer.

File systems are a serious answer to long context

One of the cleanest results in the paper concerns long-context tasks. The authors compare two settings inside LLM-in-Sandbox mode: placing documents directly in the prompt versus storing them as files in the environment.

The average score rises from 35.6 with prompt-based context to 49.6 with environment-based context. Some models show very large jumps: Claude rises from 11.9 to 61.8, DeepSeek from 16.8 to 63.8, and Kimi from 51.0 to 61.8. GPT-5 is basically flat, while Qwen-Coder and Qwen-4B perform worse with environment-based context.

This result deserves more attention than the usual “long context is expensive” complaint. The paper is not merely saying that files save tokens. It is saying that documents may be better represented as documents.

A 100K-token report placed inside a prompt is technically accessible, but operationally awkward. The model has to locate, retain, and reason over relevant pieces inside a giant sequence. In a filesystem, the same report can be searched, sliced, indexed, summarized, and revisited. The model can perform ordinary document work: list files, find keywords, inspect neighborhoods, write extractors, and preserve intermediate outputs.

For enterprise AI, this is not a small distinction. Most business knowledge does not arrive as a neat prompt. It arrives as messy folders: PDFs, spreadsheets, contracts, logs, meeting notes, screenshots, CSV exports, and half-finished decks named final_v7_really_final.pptx.

A sandboxed agent does not have to pretend that all of this should become one gigantic input string. It can treat the information environment as an environment.

The boundary is equally important: some models failed to benefit from file-based context. So “put documents in files” is not automatically better. It works when the model can navigate, search, and extract purposefully. Otherwise, the filesystem becomes another maze.

Efficiency comes from moving work out of the model’s mouth

The efficiency section is easy to misunderstand. Multi-turn agentic workflows often look expensive because they produce long trajectories. The model thinks, calls tools, observes output, writes scripts, retries, and eventually submits an answer. More steps usually means more tokens.

The paper finds a more nuanced pattern. For most non-long-context tasks, LLM-in-Sandbox can indeed consume more tokens because of exploration. But for long-context tasks, the token savings are large because the environment stores the bulk information as files rather than pushing it into the prompt. In the reported token table, long-context token consumption drops sharply: for Qwen, from about 102.9K tokens to 12.9K tokens per query; for MiniMax, from 88.4K to 13.6K; for Kimi, from 91.8K to 21.7K.

Aggregated across tasks, LLM-in-Sandbox consumes only 0.49 to 0.84 of the total tokens used by vanilla LLM mode in the tested local-serving setup.

That is not because the sandbox removes work. It relocates work.

Some information is stored in files. Some computation is done by scripts. Some feedback arrives as environment output, which can be processed differently from slow autoregressive generation. The authors report that environment tokens make up 37% to 51% of trajectories, while environment execution accounts for less than 4% of total time. Throughput remains competitive: MiniMax shows a 2.2× speedup, while the other tested local models range from 0.6× to 1.1×.

The practical lesson is not “agents are always cheaper.” Please do not put that on a slide. The lesson is more operational: sandboxed inference can be cheaper when it replaces repeated language generation with file operations, scripts, and compact final answers. It can be more expensive when the model explores poorly or when the task does not benefit from external computation.

This is why cost modeling for enterprise agents cannot stop at token count per response. It must ask where the work is done.

Weak models need to learn how not to wander

The paper’s reinforcement learning section answers the most uncomfortable question: if weaker models fail to use the environment well, can they learn?

The authors propose LLM-in-Sandbox-RL. The method trains models inside the sandbox using general context-based tasks, not benchmark-specific agent tasks. Context materials are stored as files. The model must explore the environment to answer. The reward is outcome-based: the trajectory receives reward based on final correctness, using rule-based reward functions suited to task type.

This setup matters because it avoids a lazy explanation. If the model improves only because it trained on software engineering tasks, then perhaps it merely became a better coding agent. If it improves only because context is still in the prompt, then perhaps it did not learn environment navigation. The paper’s training design pushes the model to learn a more general habit: inspect the environment, find relevant files, execute useful operations, and submit the answer.

The main RL results are strongest for the weaker Qwen3-4B-Instruct model. Before training, this model often performs worse in sandbox mode than in vanilla LLM mode. After LLM-in-Sandbox-RL, sandbox-mode results improve substantially in several domains: physics rises from 25.6 to 30.3, long-context from 10.5 to 17.6, and instruction following from 29.0 to 38.7 in sandbox mode. Mathematics also improves in both modes, with sandbox mode rising from 32.5 to 53.3.

The behavior analysis explains why. Before training, the weak model uses external resources, file operations, and computation at very low rates, while taking an average of 23.7 turns. After LLM-in-Sandbox-RL, its usage of these capabilities rises and its average turns fall to 7.0. The model becomes less busy and more useful. A rare managerial aspiration, apparently.

For the stronger Qwen3-Coder model, the improvements are more modest. That is expected. It already knows how to use the environment reasonably well. Training still improves some results, including SWE performance from 45.0 to 48.0, and biomedicine in sandbox mode from 18.2 to 36.1, but the story is no longer rescue; it is refinement.

The ablations say the environment is part of the training signal

The paper’s data-source and context-placement experiments are best read as ablations. Their purpose is not to introduce a second thesis, but to test whether the training effect depends on the kind of data and the way context is presented.

For Qwen3-4B-Instruct, the authors compare variants trained on math data, software engineering data, general context data placed in the prompt, and general context data placed in the sandbox. All variants show some cross-domain transfer. But the general-context-in-sandbox setting has the strongest overall profile in the table.

The key comparison is between Gen. in Prompt and Gen. in Sandbox. Both use general context data. The difference is whether the model receives the context directly in the prompt or must retrieve it from files in the environment. The sandbox version performs better overall, which supports the claim that environmental interaction itself contributes to generalization.

This is an important distinction for business readers. The claim is not simply “RL improves models.” That is now a crowded sentence. The more specific claim is: training models to solve general tasks while navigating files and using a computer may teach transferable operational habits.

The authors also report that sandbox-mode training improves vanilla LLM-mode behavior. After LLM-in-Sandbox-RL, models show more structural organization and verification language even when they no longer have environment access. This suggests that interaction with tools may shape internal reasoning patterns, not merely external actions.

That finding should be treated carefully. Pattern counts for verification phrases and structure are indirect signals, not proof of deep cognitive transformation. Still, they are consistent with a useful idea: models trained with executable feedback may internalize habits of decomposition and checking.

The business implication is not “buy an agent”; it is “design the workbench”

The easiest business interpretation would be the least useful one: sandboxes make agents better, therefore every company should deploy sandbox agents. That is the kind of conclusion that looks confident only because it skipped the hard part.

The paper suggests a more precise enterprise framework.

Paper finding	Directly shown	Cognaptus business inference	Boundary
Strong models improve across multiple non-code domains	Benchmark gains across math, physics, chemistry, biomedicine, long-context, and instruction following	High-capability models should be evaluated as computer-using workers, not just chat responders	Gains vary by model and domain
File-based context improves long-context performance on average	Environment-stored documents outperform prompt-stored documents on average	Document-heavy workflows may benefit from filesystem-native agents	Some models perform worse without training
Code execution improves constraint checking and computation	Case studies and domain behavior analysis show computation-oriented tool use	Business processes with verifiable outputs are strong candidates	Not all tasks need code execution
Sandbox-RL helps weaker models use the environment	Qwen3-4B improves after environment-based RL	Smaller models may become useful if trained for operational habits	Training setup, rewards, and task data matter
Infrastructure overhead is low in the tested setup	Shared image is 1.1 GB; containers use 50 MB idle and 200 MB peak	Sandbox infrastructure is not necessarily the bottleneck	Security and governance are not solved by low memory use

For business use, the immediate opportunity is not generic “AGI.” It is a more grounded class of workflows:

analyzing document folders rather than isolated prompts;
checking spreadsheet-like calculations with scripts;
generating files that users can actually open;
converting instructions into verified outputs;
creating audit trails from tool calls and intermediate artifacts;
using local scripts to reduce repeated language-model work.

This is particularly relevant for business-process automation, compliance review, investment research, financial reporting support, procurement analysis, and technical content production. These are not tasks where a model should merely sound plausible. They require files, checks, repeatability, and evidence.

The sandbox is therefore not just an execution environment. It is a governance surface. It gives the system places to log actions, restrict permissions, pin dependencies, inspect artifacts, replay failures, and enforce output formats. In enterprise deployment, those details are not engineering decoration. They are the product.

The real AGI benchmark may be whether the model can use a boring computer

The paper argues that computer environments could become an agentic capability benchmark. That proposal is sensible because a sandbox tests something that ordinary text benchmarks often hide: whether the model can translate intent into operations.

Can it inspect the available files before answering? Can it search rather than guess? Can it write a small program to verify a constraint? Can it install a relevant library without derailing the task? Can it separate scratch work from final output? Can it stop when done?

These are mundane abilities. That is precisely why they matter.

A model that can solve a benchmark in one response may still fail as a worker if it cannot manage files, verify outputs, or recover from tool errors. Conversely, a model with slightly weaker text-only performance may become operationally valuable if it uses the environment well.

This reframes evaluation. Instead of asking only, “What score does the model get when isolated from tools?”, we should also ask, “How much does the model improve when given a normal workbench?” The gap between text-only performance and sandboxed performance may itself be a measure of agentic readiness.

A good agent is not merely a model that knows things. It is a model that knows when not to keep everything in its head.

Boundaries: sandboxes are powerful, not magical

The paper’s limitations are not decorative; they affect deployment.

First, model capability matters. Strong models often benefit from sandbox access. Weak models may wander, consume turns, and degrade performance. This means sandbox deployment should include model selection and environment-use evaluation, not just API access.

Second, the environment introduces security requirements. The paper uses isolated containers, but enterprise systems need more than isolation in principle. They need network policies, package restrictions, dependency pinning, credential handling, file access boundaries, logging, and approval layers for risky operations.

Third, the paper’s case studies beyond text generation are promising but early. The generated maps, posters, videos, and music show that sandboxed models can produce usable files, but the authors explicitly note quality limits. A model can create a poster file; that does not make it a senior designer. Let us remain emotionally stable.

Fourth, benchmark gains do not automatically translate into workflow ROI. A production workflow also depends on latency tolerance, failure recovery, review cost, compliance requirements, and how often the task benefits from computation or file operations. Sandboxes are most compelling where the task naturally involves artifacts, verification, and multi-step document work.

Finally, training transfer is suggestive but not final. LLM-in-Sandbox-RL improves generalization in the tested settings, but broader claims about computer-native training need more evidence across models, organizations, tools, and real production tasks.

From chatbot to digital worker, the missing object is the workbench

The useful message of this paper is not that sandboxes are the final form of AI. The useful message is that model intelligence is partly revealed by the environment in which it acts.

A chatbot is asked to answer. A sandboxed model is asked to work.

That difference sounds small until the task involves a 100K-token document bundle, a chemistry toolchain, a strict formatting constraint, a generated file, or a calculation that should be checked rather than narrated. At that point, the interface becomes part of the intelligence.

For Cognaptus readers, the practical question is simple: when evaluating an enterprise AI system, do not only test whether it can produce a polished paragraph. Test whether it can use a basic computer responsibly.

Give it files. Give it constraints. Give it a terminal. Give it permission to verify. Then watch whether it becomes more capable, or merely more theatrical.

The future of agentic AI may not begin with a grand philosophical definition of intelligence. It may begin with something far less glamorous: a clean sandbox, a working directory, and a model that finally learns to stop pretending that every problem belongs inside a chat box.

Cognaptus: Automate the Present, Incubate the Future.

Daixuan Cheng, Shaohan Huang, Yuxian Gu, Huatong Song, Guoxin Chen, Li Dong, Wayne Xin Zhao, Ji-Rong Wen, and Furu Wei, “Computer Environments Elicit General Agentic Intelligence in LLMs,” arXiv:2601.16206v3, 2026. ↩︎

The sandbox is not a tool list; it is a new work surface#

The strongest evidence is behavioral, not just numerical#

File systems are a serious answer to long context#

Efficiency comes from moving work out of the model’s mouth#

Weak models need to learn how not to wander#

The ablations say the environment is part of the training signal#

The business implication is not “buy an agent”; it is “design the workbench”#

The real AGI benchmark may be whether the model can use a boring computer#

Boundaries: sandboxes are powerful, not magical#

From chatbot to digital worker, the missing object is the workbench#