Breaking Things on Purpose: How CLI-Gym Teaches AI to Fix the Real World
Opening — Why This Matters Now For the past two years, “AI coding agents” have been quietly conquering GitHub pull requests. Benchmarks like SWE-Bench climbed past 70% resolution rates. Investors applauded. Model sizes ballooned. Everyone nodded approvingly. Then the models walked into a terminal. On Terminal-Bench, where agents must actually interact with Linux environments—resolving dependencies, fixing broken libraries, debugging system configurations—even 100B+ parameter models struggle to reach 40% success. The gap is not incremental. It’s structural. ...