Breaking Things on Purpose: How CLI-Gym Teaches AI to Fix the Real World

Broken environments are where coding agents stop looking magical.

A model can write a neat Python function, patch a repository, and explain the bug with courtroom confidence. Then it enters a terminal, meets a missing shared library, a corrupted dependency, a bad environment variable, or a filesystem permission issue, and suddenly the “autonomous engineer” starts behaving like an intern trapped inside conda. Not a bad intern, perhaps. Just one who keeps running the same command and hoping Linux will become more emotionally cooperative.

The paper “CLI-Gym: Scalable CLI Task Generation via Agentic Environment Inversion” addresses this less glamorous side of agentic coding.¹ Its point is simple but important: real software work is not only code editing. It is also environment repair. And environment repair is hard to train because the data is scarce.

The authors’ answer is elegant in the slightly mischievous way good engineering often is: start with a healthy Dockerized repository, ask an agent to break it, capture the broken state, and then turn that state into a repair task.

In other words, if we do not have enough real histories of broken environments, manufacture them — systematically, reproducibly, and with tests attached.

That mechanism matters more than the headline benchmark number. The benchmark number is useful. The mechanism is the article.

The missing training data is not code history; it is environment history

The modern coding-agent ecosystem has been heavily shaped by code-centric benchmarks. This makes sense. GitHub repositories are unusually cooperative data objects. They contain commits, pull requests, issues, tests, and code diffs. If a researcher wants to build a training task from a real bug, the workflow is almost natural:

find a pull request that fixed a problem;
revert the change;
package the failing state;
use tests to verify whether the agent can restore correctness.

This is why code-intensive benchmarks can scale. The history is already there.

Environment-intensive tasks are different. A developer’s local runtime environment does not usually leave behind a clean, annotated biography of every mistake. There is no universal commit log for “accidentally downgraded NumPy,” “deleted a needed dynamic library,” or “poisoned PATH while trying to fix something else.” Dockerfiles help package environments, but they usually describe the final intended setup, not a rich sequence of failures and repairs.

That asymmetry is the paper’s starting point.

The authors contrast code-intensive task pipelines, such as SWE-Bench-style and SWE-Gym-style datasets, with CLI-oriented environment tasks. Terminal-Bench contains human-written terminal tasks; CLI-Gym produces 1,655 environment-intensive tasks from 29 Python repositories. The important difference is not just size. It is the data engine. Code tasks can be mined from software history. Environment tasks, at least in this paper, are generated by simulating histories that did not exist.

That is the clever inversion.

CLI-Gym turns task generation into an agent task

The core idea can be explained as a state transition.

A healthy software environment contains at least three components:

$$ S = (B, D, C) $$

where $B$ is the base environment, $D$ is the Dockerfile or environment specification, and $C$ is the codebase.

A normal repair task asks an agent to move from a poor state to a gold state:

$$ S_{poor} \rightarrow S_{gold} $$

CLI-Gym reverses the direction during data generation. It starts from a working repository environment where unit tests pass, then asks an agent to deliberately degrade it until some tests fail:

$$ (S_{gold}, T_{passed}) \rightarrow (S_{poor}, T_{failed}) $$

Once the degraded state exists, the pipeline can package it as a normal repair task. The training or evaluation agent will later see the broken environment and issue description, then try to repair it.

This is not the same as asking an LLM to hallucinate a Dockerfile. The important feature is feedback. The degradation agent acts inside an executable environment, runs commands, receives test outcomes, and produces a Dockerfile-style record of meaningful changes. The pipeline then checks whether the induced failures are real.

The paper calls this agentic environment inversion. The phrase is a little academic, but the operational meaning is clear: use agents not only to solve tasks, but also to create the broken worlds in which future agents must learn to operate.

The pipeline is chaos engineering with unit tests attached

CLI-Gym has three broad stages.

Stage	Input	Mechanism	Output
Gold environment construction	A GitHub repository	Build a Dockerized environment and verify that selected unit tests pass	A runnable oracle state
Environment inversion	The gold state and selected tests	An agent deliberately modifies the environment until tests fail	A reproducible broken state
Repair task generation	Failed tests and error feedback	An LLM synthesizes an issue description, optionally with hints	A CLI repair task instance

The degradation step is the heart of the method.

The agent is not limited to changing application code. It can manipulate filesystems, dependency versions, virtual environments, system paths, libraries, and configuration states. The paper gives a memorable example involving corrupted ELF headers in shared libraries such as libsqlite3 and libz. That kind of failure does not politely announce itself as a missing semicolon. It forces the repair agent to diagnose dynamic linking and runtime behavior.

This distinction is central. Code repair often asks: “What line should change?” Environment repair asks: “What world am I inside, and why is it failing?”

For businesses experimenting with coding agents, that second question is where many production failures live. Real deployment pipelines are not textbook repositories. They are half code, half rituals: shell scripts, package managers, Docker layers, CI settings, credentials, permissions, caches, and fragile conventions that nobody wants to document because documentation would imply someone understands them.

CLI-Gym is valuable because it trains against that mess instead of pretending the mess is peripheral.

The dataset result is large, but the filtering result is more revealing

The paper reports the following construction pipeline:

Quantity	Reported value	Interpretation
Gold instances	29 repositories	The seed environments used for generation
Generated task prompts	4,066	Candidate degradation attempts
Final problem instances	1,655	Broken environments with failing tests and task packaging
Successful repair trajectories	417	Trajectories where strong agents solved generated tasks
Filtered high-quality trajectories	291	Training data after removing trivial or shortcut-based solutions

The filtering stage deserves attention.

The authors remove trajectories with fewer than 20 steps, treating very short solutions as likely trivial. They also remove “cheating” trajectories that exploit artifacts such as cached Git information, Conda logs, or unintended shortcuts. This is not cosmetic cleaning. In agent training, shortcuts are poison with a friendly smile. If the agent learns to solve the benchmark by reading forensic leftovers rather than diagnosing the environment, the score improves and the product gets worse. Wonderful for a leaderboard; less wonderful when deployed into a build pipeline at 2 a.m.

The paper’s filtering rule reflects an important principle for enterprise agent evaluation: successful completion is not enough. You need to know how the agent succeeded. A repair that works by exploiting hidden artifacts is not operational competence. It is benchmark archaeology.

The main evidence: small targeted trajectories beat raw scale

The authors fine-tune Qwen3-based models and evaluate them on Terminal-Bench using OpenHands as the agent framework. The strongest reported model, LiberCoder-235B-A22B, reaches 46.1 pass@1 on Terminal-Bench 1.0 and 31.0 on Terminal-Bench 2.0 under the OpenHands setup. The 32B version also improves sharply.

Model	Base pass@1 on Terminal-Bench 1.0	After CLI-Gym training	Absolute gain
Qwen3-32B → LiberCoder-32B	10.3	38.9	+28.6
Qwen3-235B-A22B → LiberCoder-235B-A22B	25.0	46.1	+21.1

On Terminal-Bench 2.0, the paper reports improvements from 5.7 to 19.5 for the 32B model and from 18.1 to 31.0 for the 235B-A22B model.

These are not tiny gains. More importantly, they come from only 291 filtered environment-repair trajectories, after a first-stage training pass on 48K open-source software engineering trajectories. That detail matters because the paper is not claiming that 291 examples magically teach all agentic software work. The result is more specific: once the model has basic agentic coding ability, targeted environment-repair supervision can add a missing skill.

A reasonable reading is:

What the paper directly shows	Business meaning	Boundary
CLI-Gym trajectories improve Qwen3-based models on Terminal-Bench	Targeted operational failure data can matter more than generic code examples for CLI agents	The evidence is benchmark-based and uses OpenHands, not a production DevOps deployment
Filtered trajectories outperform lower-quality ones after agentic pretraining	Data quality becomes more valuable once the model already knows basic tool use	Filtering depends on design choices such as minimum step count and shortcut detection
Environment diversity improves performance under a fixed data budget	More varied internal systems may teach agents more than repeatedly sampling one repository	The paper’s diversity comes from 29 selected Python repositories
Larger generated datasets can be compactly stored through shared base images	Synthetic environment task generation may be operationally feasible	Token generation cost remains large: the paper reports 2.3B tokens

The result should not be flattened into “small data beats big models.” That would be too cute and not quite true. The better interpretation is: the right kind of agentic supervision can unlock capabilities that parameter scaling alone does not reliably provide.

The ablations explain what kind of data matters

The paper’s ablations are useful because they separate several mechanisms that could otherwise be confused.

First, SWE-style agentic trajectories help. They improve the models’ general tool use, navigation, and software repair behavior. This is the foundation.

Second, CLI-Gym trajectories help even more on terminal tasks. That makes intuitive sense. Environment repair requires different habits: inspecting runtime state, changing configuration, using shell feedback, recovering from failed attempts, and not getting hypnotized by the same error message.

Third, the combination works best. Generic software-engineering trajectories and specialized environment-repair trajectories are complementary. This is the least surprising result in the paper, but also the most practical. A firm building internal coding agents should not replace general software training with environment tasks. It should add environment tasks where its actual deployment failures occur.

The diversity ablation is more strategically interesting. The authors vary the number of source repositories while holding the total number of trajectories fixed at 100. Performance improves as more repositories are included. That suggests the model benefits from structurally different environments, not merely from seeing more examples.

This is the part business readers should not skip.

If your agent is trained only on one internal stack, it may become very good at that stack’s favorite disasters and fragile elsewhere. If it sees multiple repositories, package managers, configuration patterns, and failure modes, it learns a broader diagnostic routine. In ordinary terms: the agent becomes less of a memorized runbook and more of a technician.

The trajectory-scaling result points in the same direction. Performance improves as more filtered trajectories are added, but the gains plateau beyond roughly 200 trajectories. The paper does not prove that 200 is a universal threshold. It almost certainly is not. But it does suggest that, in this setting, quality and diversity start to dominate brute sample count earlier than one might expect.

The hint experiment is about yield, not secret answers

One appendix result is easy to misread.

The paper studies hints in generated repair issues. Adding hints increases the number of usable trajectories: the full hint-assisted setup yields 291 trajectories, compared with 104 without hints. But when the hinted data is subsampled to the same scale as the no-hint setting, performance is similar: 23.0 without hints versus 22.8 with hints at 104 trajectories, while the full hinted set reaches 32.4.

The likely purpose of this test is not to show that hints make agents smarter by feeding them answers. It shows that hints improve data production yield. More generated tasks become solvable by rollout agents, creating more successful trajectories for training.

That distinction matters for business adaptation. In an internal enterprise version of CLI-Gym, hints may be useful during synthetic task generation even if the final deployed agent should not receive them. The goal is to collect valid repair demonstrations, not to make the evaluation artificially easy.

This is a recurring theme in agent training: scaffolding during data creation can be legitimate if it improves the quality of demonstrations and is removed or controlled during evaluation. The scaffolding is a ladder. Do not confuse it with the building.

The behavior change may matter more than the pass rate

The paper reports that, as filtered CLI-Gym trajectories increase, the proportion of failure cases where the agent gets stuck in repetitive loops drops sharply, from 42.7% to 3.0%.

That is one of the most business-relevant results in the paper.

A coding agent that fails cleanly is annoying. A coding agent that loops expensively is operationally dangerous. It burns tokens, time, and trust. Worse, it can produce the illusion of diligence: the logs are long, commands are being run, files are being inspected, and yet nothing is moving.

The paper’s interpretation is that environment-repair supervision improves long-horizon control. That is plausible. CLI tasks require agents to observe feedback, form hypotheses, test interventions, recover from wrong turns, and stop repeating unproductive actions. Those are not just coding skills. They are control skills.

For enterprise deployment, this matters because the cost of agent failure is not only the unsolved task. It is the surrounding drag: CI minutes, developer review time, cloud spend, security exposure, and the cognitive exhaustion of reading an agent’s 300-line diary of confusion.

If CLI-Gym-style training reduces loop behavior, its value is not merely higher benchmark accuracy. It is cheaper failure.

What this means for businesses using coding agents

The direct business translation is not “fine-tune your own LiberCoder tomorrow.” Most firms should not start there. The practical lesson is about evaluation and data generation.

A company deploying coding agents can adapt the CLI-Gym idea internally:

choose Dockerized internal repositories or service templates;
define unit or integration tests that represent healthy behavior;
generate controlled environment degradations;
package the broken states as repair tasks;
evaluate whether agents can diagnose and recover without privileged shortcuts;
collect successful trajectories for future training, prompting, or runbook design.

This creates an internal benchmark that reflects the firm’s own operational reality. That is more useful than asking a general coding agent to perform beautifully on public examples and then discovering it cannot survive the company’s build system.

The ROI pathway is diagnostic before it is generative. CLI-Gym-style tasks can reveal where agents fail:

Failure class	What the firm learns	Possible operational use
Dependency repair failures	The agent cannot reason across package versions or binary compatibility	Improve environment inspection prompts or add package-management tools
Configuration failures	The agent misses path, variable, or service-level assumptions	Build better deployment context summaries
Filesystem and permission failures	The agent lacks Linux operational fluency	Restrict or expand tool permissions based on observed behavior
Repetitive loops	The agent lacks stopping and hypothesis-updating discipline	Add loop detectors, budget policies, and intervention triggers
Shortcut exploitation	The benchmark is leaking hidden artifacts	Harden evaluation design before trusting scores

This is where the paper becomes useful beyond model training. Even without fine-tuning, firms can use environment inversion to stress-test coding agents before giving them access to real repositories. The lesson is very old-fashioned: break the system in controlled ways before the world breaks it in expensive ways.

The boundaries are real, but they do not kill the idea

The paper’s limitations should be read precisely, not sprinkled like decorative sea salt over every paragraph.

First, the tasks are synthetic. They are generated by agentic degradation, not collected from actual enterprise incidents. Synthetic does not mean useless. It means the generated failures must be checked for realism, shortcut resistance, and relevance to downstream operations.

Second, the repositories are Python-focused and selected from 29 open-source projects. This provides useful diversity, but it is not the same as covering Java microservices, Kubernetes clusters, cloud IAM failures, data pipelines, mobile build systems, or legacy enterprise software held together by shell scripts and institutional guilt.

Third, the evaluation uses OpenHands. The appendix notes that agent framework choice matters: on Terminal-Bench 2.0, some models perform differently under Terminus 2, Claude Code, or OpenHands. Therefore, the reported numbers are evidence about a model-plus-agent-framework setup, not a universal model property.

Fourth, generation is not free. The authors report 2.3B tokens to produce the 1,655 tasks. For a research lab or model company, that may be acceptable. For a smaller software firm, the immediate use case may be a narrower internal benchmark rather than full-scale model training.

Fifth, the paper’s strongest claim is transfer to Terminal-Bench, not demonstrated production reliability. The results show that CLI-Gym data improves benchmark performance and model behavior under controlled evaluation. Cognaptus infers business relevance from the similarity between benchmark task types and real operational failures. That inference is reasonable, but it remains an inference.

The strategic point: environment data is becoming infrastructure

The paper’s deeper message is that agentic AI progress depends on the worlds we build for agents to practice in.

For code editing, the world is already unusually well-instrumented: commits, issues, tests, pull requests. For environment repair, the world is messier. CLI-Gym’s contribution is to make that mess generatable. It turns healthy Dockerized repositories into controlled failure factories. Slightly villainous, yes. Also useful.

The lesson for AI teams is not merely “collect more data.” It is: collect the right failure states, preserve their causal structure, verify them with tests, remove shortcuts, and diversify environments.

The lesson for business leaders is even more direct. If an AI agent will operate inside your development or deployment environment, you need to know whether it can repair that environment, not just write code inside it. A model that can produce a beautiful patch but cannot diagnose a broken runtime is not an autonomous engineer. It is a very eloquent passenger.

CLI-Gym does not solve agentic software engineering. It solves a narrower and more interesting problem: how to manufacture the missing training ground for environment repair. That is enough.

Sometimes the fastest way to build better agents is to break things first.

Carefully. Reproducibly. And preferably inside Docker, where the chaos cannot develop a taste for production.

Cognaptus: Automate the Present, Incubate the Future.

Yusong Lin, Haiyang Wang, Shuzhe Wu, Lue Fan, Feiyang Pan, Sanyuan Zhao, and Dandan Tu, “CLI-Gym: Scalable CLI Task Generation via Agentic Environment Inversion,” arXiv:2602.10999, 2026. https://arxiv.org/abs/2602.10999 ↩︎

The missing training data is not code history; it is environment history#

CLI-Gym turns task generation into an agent task#

The pipeline is chaos engineering with unit tests attached#

The dataset result is large, but the filtering result is more revealing#

The main evidence: small targeted trajectories beat raw scale#

The ablations explain what kind of data matters#

The hint experiment is about yield, not secret answers#

The behavior change may matter more than the pass rate#

What this means for businesses using coding agents#

The boundaries are real, but they do not kill the idea#

The strategic point: environment data is becoming infrastructure#