In the race to build autonomous software engineers, large language model (LLM) agents like Devin and Copilot Chat are lauded for fixing bugs, writing code, and even completing tasks from GitHub issues. But what happens when the code doesn’t even run? That’s the uncomfortable gap SetupBench aims to measure—and the results are sobering.

SetupBench introduces a 93-task benchmark evaluating a foundational but under-tested skill: bootstrapping a development environment from scratch. Unlike prior benchmarks that hand agents a fully pre-configured Docker container, SetupBench drops them into a barebones Linux sandbox and challenges them to install dependencies, initialize databases, configure background services, and resolve real-world version conflicts. It sounds simple. It isn’t.

Setup is the Real Bottleneck

Modern benchmarks over-index on coding correctness but sidestep environment configuration. This misrepresents real-world workflows where setup failures—missing packages, broken migrations, service orchestration issues—are often what prevent code from being run, tested, or deployed. SetupBench spans seven programming ecosystems and five database engines. Tasks range from Python test environments using tox to Gunicorn services over Unix sockets.

Category Instances Technologies Covered
Repo Setup 54 Python, JavaScript, TypeScript, Java, Go, Rust, C++
Dependency Resolution 16 npm, Bundler, Poetry, pip
Database Configuration 15 MySQL, Postgres, SQLite, Redis, MongoDB
Background Service Setup 8 Gunicorn, Celery, NGINX, file-watchers, autossh

Each task is validated by a single deterministic shell command like curl --unix-socket /tmp/gunicorn.sock http://localhost/ | grep -q "Hello", offering a clean pass/fail signal devoid of flaky test suites or subjective human ratings.

Performance is Poor, Even for SOTA Models

The benchmark evaluated variants of OpenHands, a state-of-the-art coding agent based on Claude and GPT-4. Across all 93 tasks, the best-performing model (Claude 4) only succeeded 62.4% of the time. Success rates were even lower for complex workflows: just 46.7% for local database setup and 50.0% for background service orchestration.

Failure Modes

  1. Missing tooling: Agents skip installing test runners like tox despite clear indicators.
  2. Hallucinated constraints: Agents rewrite ports or config files based on phantom requirements.
  3. Non-persistent installs: Tools installed with --user don’t persist across shells, breaking continuity.

These aren’t one-off issues; they represent systemic limitations in LLM agents’ understanding of real-world workflows. They don’t just fail the test—they fail the collaboration.

Agents Are Also Inefficient

Success isn’t everything. Even when agents complete tasks, they do so inefficiently. Comparing their steps to optimal human behavior reveals 38% to 69% of actions are wasted. The main inefficiencies:

  • Redundant reads (head -40, then head -60, etc.)
  • Checking for pre-installed packages despite being told the environment is clean
  • Exploring unrelated files instead of setup-relevant documentation

Claude 4, the top performer in accuracy, also used the most tokens and took the most steps. In contrast, Claude 3.5 achieved nearly the same success rate while being dramatically more efficient.

What Needs to Change

To bridge the gap between code-writing and system-readiness, future coding agents need more than better models. They need architectural changes:

  • Persistent setup protocols: Write environment changes to .bashrc or config files and source them.
  • Context-aware exploration: Rank files by likelihood of containing setup instructions; avoid wasteful exploration.
  • Efficient heuristics: Use file trees and semantic cues to mimic how humans navigate unfamiliar repos.
  • Constraint validation: Require citations or justification when changing environment variables or ports.

And perhaps most importantly: benchmarks like SetupBench should become standard in agent evaluation. Until agents can reliably run real-world code, their ability to write it remains aspirational.

Toward End-to-End Developer Agents

SetupBench provides more than just a benchmark. It lays the groundwork for a new class of evaluations—ones that test LLM agents not in isolation, but in continuity. Future versions could expand to cloud provisioning (e.g., Terraform, Kubernetes), system migrations, or chained development tasks where setup is only the beginning.

For AI agents to integrate seamlessly into developer workflows, they must first master the environment. Because in real engineering, setup isn’t a preamble. It’s the first hurdle.


Cognaptus: Automate the Present, Incubate the Future