TL;DR for operators

Setup is where many AI coding-agent promises meet the concrete floor.

The SetupBench paper introduces a 93-task benchmark that asks software engineering agents to do something less glamorous than writing a clever patch: start from a bare Linux sandbox, install what is missing, resolve dependency conflicts, initialise databases, configure services, and prove the environment works through a deterministic validation command.1

The headline result is not catastrophic, but it is not reassuring either. Across OpenHands variants, overall success ranges from 34.4% to 62.4%. The strongest evaluated model, Claude 4 Sonnet inside OpenHands, clears 62.4% overall, but does so with far higher token and step usage than lighter alternatives. Repository setup lands between 38.9% and 57.4%; local database setup is worse, between 20.0% and 53.3%.

For buyers of AI developer tools, the question is no longer simply: “Can the agent write code?” It is: “Can the agent make the code runnable, preserve the environment state, and leave a human with a usable handoff?” That is a lower-level question. Naturally, it is also the one that decides whether the demo becomes a workflow.

Cognaptus’ practical reading is straightforward:

What the paper directly shows Business meaning Boundary
SetupBench tests setup from bare environments, not pre-baked containers Agent procurement should include first-mile setup tests, not just code-editing benchmarks The benchmark is curated and still small at 93 tasks
OpenHands variants succeed only 34.4%–62.4% overall Current agents remain unreliable for unattended end-to-end engineering work Results are agent-framework- and model-specific
Failures include missing test tooling, hallucinated constraints, and non-persistent setup Reliability depends on environment-state management, not just better prompting The sandbox allowed root access and outbound networking
38%–69% of analysed setup actions are wasted Token cost, latency, and context loss are operational problems, not academic footnotes Efficiency analysis is based on a 10-instance subset

The sharp commercial lesson: coding agents are being sold as junior developers, but SetupBench evaluates whether they can behave like someone who has actually onboarded to a repository before. The difference is not subtle.

The evidence starts before the code

A familiar engineering ritual goes like this: clone the repository, read the README, install dependencies, discover that the README is lying by omission, inspect a config file, install the test runner, fix a version conflict, start a local database, and only then begin the “real” task.

Most software engineering benchmarks skip much of that ritual. They place agents inside carefully prepared containers where dependencies, services, and system packages are already present. This is convenient for evaluation. It is also a little like testing a chef by giving them a finished kitchen, a stocked pantry, and a sous-chef who quietly repaired the oven earlier.

SetupBench is designed to remove that hidden assistance. Each task begins in a fresh environment and ends only when a validation command prints Setup successful. The benchmark covers four task families: repository setup, dependency resolution, local database setup, and background-service setup. Its 93 instances span seven programming language ecosystems, five database engines, and several service orchestration scenarios.

That makes the paper evidence-first by nature. The contribution is not a new agent architecture, nor an elegant theory of software work. It is a measurement instrument aimed at a missing part of the agent evaluation stack: the first mile between “here is a repository” and “the thing runs.”

The distinction matters because many agent failures are not failures of programming syntax. They are failures of operational interpretation. The agent must infer that tox.ini means tox should exist. It must understand that a database migration may require a running service, credentials, or seeded data. It must know that a PATH update made in one shell might vanish when a validation harness runs in another.

That is not glamorous intelligence. It is maintenance intelligence. Unfortunately, businesses run on rather a lot of that.

SetupBench tests the part benchmarks usually pre-install

The benchmark’s design is intentionally practical. Each instance provides three things: a natural-language problem statement, a workspace snapshot, and a deterministic success command. The success command is not an LLM judge. It is a one-line validation command that returns a literal success or failure signal.

The paper’s task families have different operational meanings:

Task family Instances What the agent must demonstrate Why it matters in production-like work
Repository setup 54 Read project documentation, install system and language dependencies, build or launch the project This is the normal first step before bug fixing or feature work
Dependency resolution 16 Diagnose and repair real package-manager conflicts from npm and Bundler-style ecosystems Version conflicts are common blockers in legacy and active projects
Local database setup 15 Install, configure, seed, migrate, or repair PostgreSQL, MySQL, SQLite, Redis, or MongoDB setups Many applications cannot be meaningfully tested without stateful services
Background-service setup 8 Configure long-running processes such as Gunicorn, Celery, NGINX, file watchers, autossh, or producer-consumer pipelines Modern apps are rarely a single foreground process politely waiting to be tested

The most important design choice is the bare sandbox. Prior code-editing benchmarks often evaluate whether the agent can change code after the environment has already been made runnable. SetupBench asks whether the agent can create that runnable state in the first place.

This is a different competence. It combines reading, planning, shell execution, package management, service orchestration, and verification. It also exposes a behaviour that conventional code benchmarks can miss: an agent may be excellent at editing a function once tests run, but poor at discovering what must be installed before tests can run.

That distinction should make procurement teams slightly more annoying, which is usually how reliability improves.

The best evaluated agent still fails more than one-third of the time

The paper evaluates OpenHands variants using several underlying models. The overall success rates are the first reality check.

OpenHands variant Overall success rate Avg tokens Avg steps
GPT-4o 34.4% 303K 23.2
GPT-4.1 50.5% 436K 29.5
Claude 3.5 Sonnet 53.8% 455K 21.6
Claude 3.7 Sonnet 57.0% 869K 35.7
Claude 4 Sonnet 62.4% 1,129K 47.1

The best result, 62.4%, is meaningful. These agents are not helpless. They can often install packages, launch services, and resolve dependencies. But from an operator’s perspective, a roughly 38% failure rate on setup is not a background detail. It is the difference between “delegate this workflow” and “supervise this workflow while pretending supervision is not work.”

The category breakdown is more revealing than the aggregate.

Task family Lowest success Highest success Interpretation
Background-service setup 50.0% 87.5% Small task count, but agents can often coordinate visible service behaviours
Local database setup 20.0% 53.3% Stateful configuration remains difficult
Repository setup 38.9% 57.4% The most common first-mile task is still unreliable
Dependency resolution 25.0% 87.5% Strong models do well, but weaker variants collapse sharply

Two patterns stand out.

First, local database setup is the weak spot. That is not surprising. Database tasks force the agent to coordinate installation, configuration, permissions, ports, migrations, and validation. Stateless code editing does not prepare an agent for that mess. Neither do leaderboards where the mess has been swept under the container.

Second, more capable models are not automatically more economical. Claude 4 reaches the highest overall success rate, but uses substantially more tokens and more steps than Claude 3.5 or GPT-4.1. The paper notes that Claude 4 consumes 104.9 million tokens and 4,377 tool steps across the evaluation, nearly triple the tokens and more than double the steps of Claude 3.5 Sonnet.

That is not a reason to dismiss higher-capacity models. It is a reason to stop treating “best model everywhere” as an architecture. Setup tasks may need routing: cheap model for routine inspection, stronger model for ambiguous dependency conflicts, and explicit escalation when the agent detects uncertainty instead of wandering through the filesystem like a raccoon with terminal access.

The failures are boring, which is why they are dangerous

The paper’s failure analysis matters more than the leaderboard. It identifies three recurring failure modes: ignored test tooling, hallucinated constraints, and non-persistent environment setup.

These are not exotic weaknesses. They are precisely the kind of weaknesses that create friction in real engineering teams.

Agents install the app but forget the development environment

One failure mode is incomplete tooling installation. Agents often install runtime dependencies but miss development or test tools. In the paper’s example, the agent installs Python and project dependencies, but fails to account for tox.ini; the validation command then fails because tox is missing.

This is a small failure with a large implication. Human developers often infer setup requirements from conventional files: tox.ini, package.json, Makefile, pyproject.toml, Gemfile, docker-compose.yml, CONTRIBUTING.md. A runnable development environment is not merely the application runtime. It is also the test harness, build tooling, linting stack, migration tooling, and service assumptions that make downstream work verifiable.

The paper attributes ignored test tooling to roughly 17%–26% of unsuccessful repo-setup failures across models, depending on the variant. That is not a rare edge case. It is a recurring sign that agents under-read the operational meaning of repository structure.

For enterprise use, this is the “green demo, red CI” problem. The agent may believe setup is done because the app imports or starts. The business only discovers the miss when tests, validation, or handoff fails later. Very efficient, if the objective is to redistribute annoyance.

Agents invent constraints and then obediently satisfy them

The second failure mode is hallucinated task constraints. The agent infers a non-existent requirement, such as a port number, and modifies configuration accordingly. The result is not mere inaction; it is harmful action.

This is especially important for agentic systems because tools convert hallucination into side effects. A chatbot that invents a port number is irritating. An agent that edits the server configuration to satisfy the invented port number is a production-risk rehearsal.

The obvious fix is not “better vibes in the prompt.” The paper points toward constraint validation mechanisms: require the agent to cite or identify the documentation basis for configuration-changing decisions. In business terms, configuration edits should carry provenance. If an agent changes a port, dependency version, environment variable, or service command, it should be able to answer: “Where did that requirement come from?”

That turns grounding from a language-model aspiration into an engineering control.

Agents make changes that vanish at handoff

The third failure mode is non-persistent setup. The agent installs a tool or modifies environment variables in a way that works in its current shell, but fails when the evaluation harness opens a fresh terminal. In the paper’s example, pnpm was installed but unavailable to the validation command.

This is a handoff failure. And handoff is exactly where many enterprise agent workflows are heading: asynchronous agents working in cloud sandboxes, then returning a repository, logs, or a development environment to humans.

A setup step that only exists in the agent’s shell is not setup. It is a temporary hallucination with filesystem side effects.

The paper’s design implication is sensible: agents need explicit persistence protocols. Environment modifications should be written to persistent configuration files where appropriate, sourced in the current session, and summarised for the human operator. The phrase “agent-human collaboration” can sound grand. Here it means something wonderfully mundane: when the human opens a new shell, the tool should still exist.

Wasted actions are a cost model hiding inside a benchmark

SetupBench also examines efficiency. This part is easy to overlook because success rate is louder. It should not be overlooked.

The authors compare agent trajectories against human-derived minimal action counts for a 10-instance subset. They exclude actions without meaningful human equivalents, such as internal reasoning calls, completion signals, initial prompt/context steps, and polling commands. The remaining comparison isolates repository exploration behaviour.

The result: 38%–69% of analysed setup actions are wasted.

Model Total steps Optimal steps Wasted steps Wasted share
Claude 3.5 Sonnet 186 124 71 38.17%
Claude 3.7 Sonnet 332 124 208 62.65%
Claude 4 Sonnet 397 124 273 68.77%
GPT-4o 193 124 76 39.38%
GPT-4.1 238 124 114 47.90%

The main evidence here is not that every exact percentage will generalise. The analysis is based on a 10-instance subset, so it is best read as behavioural diagnosis rather than a universal efficiency law.

Still, the behaviours are concrete. Agents repeatedly issue partial reads of the same file, check for installations despite being told the environment is bare, use unnecessary sudo, and inspect files unrelated to setup completion. These actions consume tokens, time, and attention. In long workflows, they also increase the chance that the agent loses the thread before reaching the actual engineering task.

This is where cost accounting becomes architecture. A developer-agent platform that pays for redundant exploration on every task is not just slower. It also compresses the context budget available for the work the user actually wanted.

The deeper issue is representational. Humans build a quick mental map of a repository: where docs live, where build files live, which files are likely to matter. Agents often operate reactively through low-level shell commands. They see fragments. They forget. They reread. They wander.

Better agents may need repository memory as an architectural component: cached directory structures, semantic ranking of setup-relevant files, batched inspection, and explicit state about what has already been read. Not because this sounds sophisticated, but because head -40, then head -60, then head -100 on the same file is not a strategy. It is a cry for tooling.

The appendix is implementation scaffolding, not a second thesis

The appendices are useful because they clarify how the benchmark was built and how it can be extended. They include an example dataset entry, prompt templates for deriving setup instructions and synthesising success commands, and scripts or assets for mining dependency-resolution tasks.

Their likely purpose is implementation detail and reproducibility support, not an independent experimental claim. The main evidence remains the SetupBench construction, OpenHands evaluation, failure analysis, and efficiency analysis.

That distinction matters. A benchmark paper often lives or dies by whether others can inspect, reproduce, and extend its task generation process. Here, the appendices show the benchmark’s practical machinery: what fields an instance contains, how success criteria are produced, how dependency conflicts are mined, and how validation is framed.

For business readers, the appendix lesson is not “copy these prompts.” It is that serious agent evaluation needs deterministic validation and task-level structure. A procurement test that asks an agent to “set this up” and then lets a manager subjectively judge the transcript is theatre. SetupBench’s validation command is a more useful pattern: define success externally, run it after the agent finishes, and check whether the environment state actually works.

What Cognaptus infers for businesses using coding agents

The paper directly shows that current OpenHands-based agents struggle with bare-environment setup across a curated benchmark. Cognaptus’ business inference is broader but bounded: organisations should treat setup competence as a separate acceptance criterion for AI developer agents.

That changes evaluation in four practical ways.

1. Add first-mile setup to agent procurement

Many tool evaluations still centre on code generation, issue resolution, or chat quality. SetupBench suggests that enterprise pilots should include tasks such as:

  • clone and configure a real internal repository from a minimal container;
  • install all runtime and test dependencies;
  • initialise local services and databases;
  • run the project’s actual validation command;
  • document persistent environment changes;
  • hand the environment back to a human in a fresh shell.

This is not an academic purity test. It is closer to the first day of onboarding.

2. Require evidence for configuration edits

Hallucinated constraints are dangerous because they produce confident side effects. Agent workflows should require configuration-changing actions to be linked to evidence: a README line, a package manifest, an error message, a service log, or a user instruction.

A practical policy might be simple: no unexplained edits to ports, environment variables, dependency versions, service commands, or database configuration. The agent should attach a reason and source for each such change. Tedious? Yes. Also known as engineering.

3. Treat persistence as part of the deliverable

Agent output should not be limited to code diffs. For setup tasks, the deliverable should include:

Deliverable Why it matters
Persistent environment changes Prevents tools from disappearing in fresh shells
Commands executed Enables audit and replay
Services started and how Clarifies whether the environment depends on background processes
Validation command and result Separates “I think it works” from “it passed”
Known residual issues Helps humans resume without archaeology

The agent should not merely leave a terminal transcript. It should leave a durable setup contract.

4. Route models by setup complexity

The paper shows performance-efficiency trade-offs. Stronger models can achieve higher success, especially on harder categories, but often consume more tokens and steps. That points toward dynamic routing.

Routine setup detection, file-tree inspection, and standard install commands may not require the most expensive model. Ambiguous dependency conflicts, database failures, or multi-service orchestration may justify escalation. The economic unit is not “one agent run.” It is “successful verified setup per dollar and minute.”

That is less catchy than “autonomous software engineer.” It is also closer to how operations budgets work.

What remains uncertain

SetupBench is valuable because it isolates a neglected capability. It is not a complete map of agentic software engineering.

The benchmark has 93 tasks, all manually reviewed and verified. That supports clarity and deterministic evaluation, but limits scale. Its results should not be treated as the final distribution of all setup work.

The evaluation uses OpenHands variants. That means the measured outcomes reflect both the underlying models and the agent framework. A different framework with stronger repository indexing, state persistence, or validation planning could perform differently.

The sandbox is permissive: root privileges and outbound networking are available. Real enterprise environments often restrict permissions, credentials, network access, package sources, and service ports. Those constraints could make setup harder, not easier.

The domain coverage is broad but incomplete. The paper covers seven language ecosystems, five databases, and several service orchestration patterns, but does not yet include GPU drivers, message queues beyond Redis, Docker Compose, Kubernetes, or infrastructure-as-code tools. Those omissions matter for companies evaluating agents in data engineering, MLOps, platform engineering, or cloud migration work.

So the right conclusion is not “agents cannot do setup.” The right conclusion is sharper: setup is a distinct capability, current agents are uneven at it, and businesses should stop assuming it comes free with code generation.

The real benchmark is whether the human can continue

The most useful line of interpretation is this: SetupBench evaluates not only whether an agent can prepare an environment, but whether its work survives contact with the next actor in the workflow.

That next actor may be a validation harness. It may be a developer opening a fresh shell. It may be CI. It may be another agent. In all cases, the setup must persist, the assumptions must be grounded, and the commands must lead to a verifiable state.

This is the unromantic layer of software engineering automation. It has fewer dramatic demos than code synthesis. It also determines whether the code synthesis can be used.

The first hurdle is not writing the patch. It is getting to the point where a patch can be tested. SetupBench makes that hurdle visible. Now agent builders, buyers, and operators have less excuse to trip over it theatrically.

Cognaptus: Automate the Present, Incubate the Future.


  1. Avi Arora, Jinu Jang, and Roshanak Zilouchian Moghaddam, “SetupBench: Assessing Software Engineering Agents’ Ability to Bootstrap Development Environments,” arXiv:2507.09063, 2025. The arXiv HTML page was unavailable during drafting, so this article uses the PDF version. ↩︎