Breaking Things on Purpose: How CLI-Gym Teaches AI to Fix the Real World

Opening — Why This Matters Now

For the past two years, “AI coding agents” have been quietly conquering GitHub pull requests. Benchmarks like SWE-Bench climbed past 70% resolution rates. Investors applauded. Model sizes ballooned. Everyone nodded approvingly.

Then the models walked into a terminal.

On Terminal-Bench, where agents must actually interact with Linux environments—resolving dependencies, fixing broken libraries, debugging system configurations—even 100B+ parameter models struggle to reach 40% success. The gap is not incremental. It’s structural.

The uncomfortable truth: writing code is not the same as repairing an environment.

The paper “CLI-Gym: Scalable CLI Task Generation via Agentic Environment Inversion” (arXiv:2602.10999v1) introduces what may be the first scalable answer to this imbalance. And it does so with a rather elegant idea: if you want agents to learn how to fix broken systems, you should first teach them how to break them.

Yes. Deliberately.

Background — The Asymmetry in Agent Training

The modern coding agent ecosystem is heavily optimized for code-intensive tasks.

Repositories track commit histories. Pull requests document failures. Unit tests provide verification. Version control gives you clean “before” and “after” states.

That infrastructure makes task generation straightforward:

Revert a commit.
Capture the buggy state.
Package the PR context.
Create a benchmark instance.

Environment-intensive tasks do not enjoy this luxury.

Your Docker container does not maintain a rich, annotated history of misconfigurations. There is no neatly logged “oops, corrupted shared library at 14:03.” And runtime environments vary across developers.

The result? Terminal-Bench 1.0 and 2.0 together contain fewer than 200 human-written tasks.

Meanwhile, code-centric datasets scale to tens of thousands.

The bottleneck is not model capacity. It is environment data scarcity.

CLI-Gym addresses precisely that.

Analysis — Turning Task Creation into an Agentic Problem

1. Reframing Agentic Coding as State Transition

The authors formalize a coding environment as:

$$ S = (B, D, C) $$

Where:

$B$ = base image
$D$ = Dockerfile (environment configuration)
$C$ = codebase

Conventional agentic coding aims to transform:

$$ S_{poor} \rightarrow S_{gold} $$

CLI-Gym inverts this.

Instead of fixing environments, it asks an agent to degrade a working environment into a broken one:

$$ (S_{gold}, T_{passed}) \rightarrow (S_{poor}, T_{failed}) $$

This is the core conceptual move: treat task generation itself as an agentic process.

Rather than manually crafting failures, an agent explores a “history space” of possible environment corruptions.

It executes commands. It perturbs Dockerfiles. It corrupts dependencies. It induces unit test failures.

And once failure occurs, that broken state becomes a reproducible CLI task instance.

Scalable. Automated. No human labeling required.

2. The Pipeline: From Gold Environment to Repair Task

The CLI-Gym workflow consists of three stages:

Stage	Input	Process	Output
Gold Construction	GitHub repo	Build Docker image, verify UT pass	Runnable environment
Environment Inversion	Gold state	Agent induces failures via Dockerfile actions	Broken environment
Task Generation	Failed UTs	LLM synthesizes issue description	CLI repair task

Notably, degradation is not superficial.

One example (shown in the paper) corrupts ELF headers of shared libraries (libsqlite3, libz) to trigger system-level failures. This is no trivial syntax error. It forces agents to diagnose dynamic linking issues.

This pushes models beyond “edit code” toward genuine systems reasoning.

Findings — Scale Changes Behavior

CLI-Gym generated:

1,655 environment-intensive tasks
From 29 open-source repositories
With 291 high-quality repair trajectories curated for training

For context:

Dataset	# Tasks	Human Labor	Focus
Terminal-Bench	80–89	93 contributors	CLI repair
CLI-Gym	1,655	Fully automated	CLI repair

The training impact is striking.

Terminal-Bench 1.0 Results (Pass@1)

Model	Base	After CLI-Gym	Improvement
Qwen3-32B	10.3%	38.9%	+28.6%
Qwen3-235B-A22B	25.0%	46.1%	+21.1%

With only 291 trajectories.

Even more interesting: 32B models trained with CLI-Gym outperform much larger open-weight models (including 480B variants).

Scaling parameters is helpful. Scaling environment diversity is transformative.

Data Diversity > Data Volume

Two ablations reveal something subtle but important:

Increasing repository diversity improves performance even when total trajectory count is fixed.
Beyond ~200 trajectories, performance gains plateau.

In short:

Environment diversity matters more than raw sample size.

This mirrors a broader pattern in AI training: exposure to structurally different failure modes builds generalizable competence.

Behavioral Shift: Fewer “Stuck in Loop” Failures

The paper reports a sharp drop in repetitive loop failures (from ~42% to ~3%) as CLI-Gym supervision increases.

This suggests something deeper than accuracy improvement.

Environment repair training appears to strengthen:

Long-horizon planning
Error recovery
Exploration stability
Context retention under CLI feedback

In other words, it improves agent control dynamics, not just correctness.

Implications — Why This Matters for Real-World AI Systems

1. DevOps and Production AI

Most enterprise AI agents fail not in writing functions—but in managing environments:

Dependency conflicts
Container misconfigurations
Runtime path issues
Corrupted shared libraries
Security permission mismatches

CLI-Gym directly targets this failure class.

If you are deploying AI agents into DevOps pipelines, this is not academic—it is operational risk reduction.

2. Model Scaling vs Data Scaling

The leaderboard comparison reveals a strategic insight:

Strategy	Outcome
Increase model size	Incremental gains
Increase environment training data	Structural gains

For businesses deciding between compute scaling and data engineering investment, CLI-Gym argues strongly for the latter.

The ROI profile is compelling:

2.3B token cost
Fully automated pipeline
20x scale increase over human tasks

That is infrastructure leverage.

3. The Future of Agent Training

CLI-Gym suggests a generalizable paradigm:

To train agents for real-world complexity, invert the world.

Instead of asking:

“How do we collect more failure examples?”

Ask:

“How can agents generate their own failure spaces?”

This transforms dataset creation from a bottleneck into a self-improving loop.

It also hints at a future where:

Agents continuously stress-test environments
Synthetic failure corpora evolve
Training becomes environment-adaptive

We may eventually see “Environment Gym” ecosystems across:

Cloud orchestration
Cybersecurity
Financial infrastructure
Robotics control systems

CLI-Gym is likely the first public step in that direction.

Conclusion — Breaking to Build Better

The leap from code-writing AI to environment-repair AI is not cosmetic. It is architectural.

CLI-Gym reframes the problem: rather than waiting for humans to label failures, let agents generate structured degradation histories themselves.

The result is not just higher benchmark scores. It is more stable, less loop-prone, more adaptable agents.

And perhaps most importantly—it demonstrates that scalable progress in agentic AI may depend more on how we generate environments than how we scale parameters.

Sometimes the fastest way forward is to break things—systematically, reproducibly, and at scale.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why This Matters Now#

Background — The Asymmetry in Agent Training#

Analysis — Turning Task Creation into an Agentic Problem#

1. Reframing Agentic Coding as State Transition#

2. The Pipeline: From Gold Environment to Repair Task#

Findings — Scale Changes Behavior#

Terminal-Bench 1.0 Results (Pass@1)#

Data Diversity > Data Volume#

Behavioral Shift: Fewer “Stuck in Loop” Failures#

Implications — Why This Matters for Real-World AI Systems#

1. DevOps and Production AI#

2. Model Scaling vs Data Scaling#

3. The Future of Agent Training#

Conclusion — Breaking to Build Better#