Agents don’t build Rome from scratch—they retrofit the city. GitTaskBench (arXiv:2508.18993) is the first benchmark that grades code agents on how well they exploit existing GitHub repositories to deliver real-world outcomes, not just pass algorithm puzzles. It also puts a price tag on success via an Alpha value that blends accuracy with cost, bringing long-missing business realism to agent evals.

TL;DR

  • What’s new: 54 tasks across 7 modalities (image, video, speech, office docs, web scraping, security/privacy, biosignals), each paired to a real repo and a practical, automated test harness.
  • Why it matters: The hard part isn’t just writing code—it’s environment setup, dependency wrangling, repo comprehension, and workflow orchestration.
  • Headline result: Even the best stack—OpenHands + Claude 3.7—passes only ~48% of tasks; environment/setup issues cause ~65% of all failures.
  • Business twist: The Alpha value estimates net economic benefit per task by combining success, quality, and token costs. Expensive tasks become clear wins; cheap tasks require ruthless cost control.

The Benchmark, de-jargoned

Problem framed: In real shops, devs search, fork, and adapt. GitTaskBench simulates that reality. Each task gives an agent a specific repo (e.g., DeOldify, Scrapy, NeuroKit, SpeechBrain) and a concrete user goal (e.g., “colorize this photo” or “extract author/quote pairs into CSV”). Success is determined by a task-specific metric (e.g., NIQE for image quality; SNR/SDR for speech separation; field-level F1 for scraping; column/row fidelity for office docs) and an execution check (the thing actually runs and outputs in the right format).

What’s being tested (and too often ignored elsewhere):

  1. Repo literacy — finding entry points, understanding configs, tracing module boundaries.
  2. Autonomous setup — installing the right Python wheels, system libs, and model weights; patching version conflicts.
  3. Task focus — writing/adjusting glue code to deliver the requested output, not a demo screenshot.

The Alpha Value: pricing “agent usefulness”

Classic benchmarks stop at pass/fail. GitTaskBench asks, “Was it worth it?”

Alpha per task = (Success × Market Value × Quality) − Cost

Where:

  • Success: 1 if the agent finishes end‑to‑end (execution completes) and meets the task threshold; else 0.
  • Market Value (MV): human contractor price for a comparable deliverable (e.g., $10/photo restore).
  • Quality (Q): rater consensus vs. a human ground truth (levels from 0 to 1).
  • Cost (C): model API spend (or deployment cost).

Why it’s clever: It separates technically impressive but economically dubious wins from boring but profitable ones. If an agent nails a high‑MV task with decent Q at low C, Alpha is strongly positive. Conversely, low‑MV image filters become money sinks if tokens bloat.

What actually worked (and didn’t)

  • Frameworks: OpenHands consistently topped Aider and SWE‑Agent on success; SWE‑Agent was leaner on tokens (a solid budget choice).
  • Models: Closed-source leaders still lead. Claude 3.7 + OpenHands reached the 72% execution completion and ~48% task pass range; GPT‑4.1 trailed slightly but was far more cost‑efficient. High‑quality open models (Qwen/DeepSeek) posted meaningful—but still lagging—results.
  • Modalities: Agents do best on text/office tasks (straightforward library APIs) and struggle on multimodal ones (vision/speech) with heavy weights, fragile dependencies, and bespoke runners.
  • Failure anatomy: Environment/setup (deps, wheels, system libs) dominates. Next: weak workflow planning, repo misunderstanding, and runtime timeouts.

Why this matters for product & engineering leaders

Most teams evaluating “AI devs” optimize for unit tests and coding toys. GitTaskBench forces a workflow view:

  • Throughput reality: If your agent can’t create a clean venv, pin versions, download/check weights, and rerun deterministically, it won’t ship.
  • Stack selection as a portfolio: Pick a primary framework+model for general tasks (e.g., OpenHands+GPT‑4.1 for balanced ROI), then maintain niche specialists (e.g., Claude for complex repo spelunking) and switch by task economics.
  • Guardrails ≠ governance: Most failures weren’t “bad prompts”—they were ops failures. Bake in environment recipes, retry trees, cached wheels, checksum-verified assets, and scriptable fallbacks (pip→uv, torch CUDA variants, prebuilt wheels, CPU fallbacks).

A practical buyer’s guide (fast track)

Use case Your priority Suggested default When to switch
Office docs, PDF split/parse Cost & reliability SWE‑Agent + GPT‑4.1 If accuracy stalls on edge PDFs → OpenHands + Claude
Web scraping/crawling Anti‑fragility to site quirks OpenHands + GPT‑4.1 Heavier anti‑bot logic or headless flows → Claude
Image/speech model pipelines Setup resilience OpenHands + Claude Budget pressure → OpenHands + DeepSeek (monitor Alpha)
Biosignal analytics (NeuroKit) Domain correctness OpenHands + GPT‑4.1 If metrics fail repeatedly → Claude, longer timeouts

Ops playbook: reduce 65% of failures

  1. Deterministic envs: lockfiles (uv pip compile/pip-tools), --only-binary :all: where possible, and wheel cache mirrors.
  2. Hardware awareness: detect CUDA/CPU and select wheels accordingly; pass TORCH_CUDA_ARCH_LIST when needed.
  3. Weights & assets: mirror model files; verify with checksums; set offline fallbacks.
  4. Entry-point discovery: heuristic search for cli.py, inference.py, or __main__ blocks; auto-read README task sections.
  5. Timeout design: longer first-run windows; progressive backoff; resume from last successful step; pre-flight dry‑runs.
  6. Quality gates: wire task‑native metrics (NIQE/SSIM/SNR/F1) into the loop so the agent can self-correct before “final.”

What it changes for Cognaptus

For clients considering agentic automation, GitTaskBench-like evaluation should be the RFP: give your actual repos, define outputs and quality gates, and compute Alpha. We’ll design portfolio policies that route tasks to the cheapest stack that still clears quality—then report Alpha monthly so the CFO sees value, not just velocity.

Bottom line: GitTaskBench is the first serious nudge away from “can the agent code?” toward “can the agent deliver?” Once you price delivery, your stack choices—and your roadmap—get a lot simpler.


Cognaptus: Automate the Present, Incubate the Future