Opening — Why This Matters Now

For the past two years, “AI coding agents” have been quietly conquering GitHub pull requests. Benchmarks like SWE-Bench climbed past 70% resolution rates. Investors applauded. Model sizes ballooned. Everyone nodded approvingly.

Then the models walked into a terminal.

On Terminal-Bench, where agents must actually interact with Linux environments—resolving dependencies, fixing broken libraries, debugging system configurations—even 100B+ parameter models struggle to reach 40% success. The gap is not incremental. It’s structural.

The uncomfortable truth: writing code is not the same as repairing an environment.

The paper “CLI-Gym: Scalable CLI Task Generation via Agentic Environment Inversion” (arXiv:2602.10999v1) introduces what may be the first scalable answer to this imbalance. And it does so with a rather elegant idea: if you want agents to learn how to fix broken systems, you should first teach them how to break them.

Yes. Deliberately.


Background — The Asymmetry in Agent Training

The modern coding agent ecosystem is heavily optimized for code-intensive tasks.

Repositories track commit histories. Pull requests document failures. Unit tests provide verification. Version control gives you clean “before” and “after” states.

That infrastructure makes task generation straightforward:

  1. Revert a commit.
  2. Capture the buggy state.
  3. Package the PR context.
  4. Create a benchmark instance.

Environment-intensive tasks do not enjoy this luxury.

Your Docker container does not maintain a rich, annotated history of misconfigurations. There is no neatly logged “oops, corrupted shared library at 14:03.” And runtime environments vary across developers.

The result? Terminal-Bench 1.0 and 2.0 together contain fewer than 200 human-written tasks.

Meanwhile, code-centric datasets scale to tens of thousands.

The bottleneck is not model capacity. It is environment data scarcity.

CLI-Gym addresses precisely that.


Analysis — Turning Task Creation into an Agentic Problem

1. Reframing Agentic Coding as State Transition

The authors formalize a coding environment as:

$$ S = (B, D, C) $$

Where:

  • $B$ = base image
  • $D$ = Dockerfile (environment configuration)
  • $C$ = codebase

Conventional agentic coding aims to transform:

$$ S_{poor} \rightarrow S_{gold} $$

CLI-Gym inverts this.

Instead of fixing environments, it asks an agent to degrade a working environment into a broken one:

$$ (S_{gold}, T_{passed}) \rightarrow (S_{poor}, T_{failed}) $$

This is the core conceptual move: treat task generation itself as an agentic process.

Rather than manually crafting failures, an agent explores a “history space” of possible environment corruptions.

It executes commands. It perturbs Dockerfiles. It corrupts dependencies. It induces unit test failures.

And once failure occurs, that broken state becomes a reproducible CLI task instance.

Scalable. Automated. No human labeling required.


2. The Pipeline: From Gold Environment to Repair Task

The CLI-Gym workflow consists of three stages:

Stage Input Process Output
Gold Construction GitHub repo Build Docker image, verify UT pass Runnable environment
Environment Inversion Gold state Agent induces failures via Dockerfile actions Broken environment
Task Generation Failed UTs LLM synthesizes issue description CLI repair task

Notably, degradation is not superficial.

One example (shown in the paper) corrupts ELF headers of shared libraries (libsqlite3, libz) to trigger system-level failures. This is no trivial syntax error. It forces agents to diagnose dynamic linking issues.

This pushes models beyond “edit code” toward genuine systems reasoning.


Findings — Scale Changes Behavior

CLI-Gym generated:

  • 1,655 environment-intensive tasks
  • From 29 open-source repositories
  • With 291 high-quality repair trajectories curated for training

For context:

Dataset # Tasks Human Labor Focus
Terminal-Bench 80–89 93 contributors CLI repair
CLI-Gym 1,655 Fully automated CLI repair

The training impact is striking.

Terminal-Bench 1.0 Results (Pass@1)

Model Base After CLI-Gym Improvement
Qwen3-32B 10.3% 38.9% +28.6%
Qwen3-235B-A22B 25.0% 46.1% +21.1%

With only 291 trajectories.

Even more interesting: 32B models trained with CLI-Gym outperform much larger open-weight models (including 480B variants).

Scaling parameters is helpful. Scaling environment diversity is transformative.


Data Diversity > Data Volume

Two ablations reveal something subtle but important:

  1. Increasing repository diversity improves performance even when total trajectory count is fixed.
  2. Beyond ~200 trajectories, performance gains plateau.

In short:

Environment diversity matters more than raw sample size.

This mirrors a broader pattern in AI training: exposure to structurally different failure modes builds generalizable competence.


Behavioral Shift: Fewer “Stuck in Loop” Failures

The paper reports a sharp drop in repetitive loop failures (from ~42% to ~3%) as CLI-Gym supervision increases.

This suggests something deeper than accuracy improvement.

Environment repair training appears to strengthen:

  • Long-horizon planning
  • Error recovery
  • Exploration stability
  • Context retention under CLI feedback

In other words, it improves agent control dynamics, not just correctness.


Implications — Why This Matters for Real-World AI Systems

1. DevOps and Production AI

Most enterprise AI agents fail not in writing functions—but in managing environments:

  • Dependency conflicts
  • Container misconfigurations
  • Runtime path issues
  • Corrupted shared libraries
  • Security permission mismatches

CLI-Gym directly targets this failure class.

If you are deploying AI agents into DevOps pipelines, this is not academic—it is operational risk reduction.


2. Model Scaling vs Data Scaling

The leaderboard comparison reveals a strategic insight:

Strategy Outcome
Increase model size Incremental gains
Increase environment training data Structural gains

For businesses deciding between compute scaling and data engineering investment, CLI-Gym argues strongly for the latter.

The ROI profile is compelling:

  • 2.3B token cost
  • Fully automated pipeline
  • 20x scale increase over human tasks

That is infrastructure leverage.


3. The Future of Agent Training

CLI-Gym suggests a generalizable paradigm:

To train agents for real-world complexity, invert the world.

Instead of asking:

“How do we collect more failure examples?”

Ask:

“How can agents generate their own failure spaces?”

This transforms dataset creation from a bottleneck into a self-improving loop.

It also hints at a future where:

  • Agents continuously stress-test environments
  • Synthetic failure corpora evolve
  • Training becomes environment-adaptive

We may eventually see “Environment Gym” ecosystems across:

  • Cloud orchestration
  • Cybersecurity
  • Financial infrastructure
  • Robotics control systems

CLI-Gym is likely the first public step in that direction.


Conclusion — Breaking to Build Better

The leap from code-writing AI to environment-repair AI is not cosmetic. It is architectural.

CLI-Gym reframes the problem: rather than waiting for humans to label failures, let agents generate structured degradation histories themselves.

The result is not just higher benchmark scores. It is more stable, less loop-prone, more adaptable agents.

And perhaps most importantly—it demonstrates that scalable progress in agentic AI may depend more on how we generate environments than how we scale parameters.

Sometimes the fastest way forward is to break things—systematically, reproducibly, and at scale.

Cognaptus: Automate the Present, Incubate the Future.