
Wheel Smarts > Wheel Reinvention: What GitTaskBench Really Measures
Agents don’t build Rome from scratch—they retrofit the city. GitTaskBench (arXiv:2508.18993) is the first benchmark that grades code agents on how well they exploit existing GitHub repositories to deliver real-world outcomes, not just pass algorithm puzzles. It also puts a price tag on success via an Alpha value that blends accuracy with cost, bringing long-missing business realism to agent evals. TL;DR What’s new: 54 tasks across 7 modalities (image, video, speech, office docs, web scraping, security/privacy, biosignals), each paired to a real repo and a practical, automated test harness. Why it matters: The hard part isn’t just writing code—it’s environment setup, dependency wrangling, repo comprehension, and workflow orchestration. Headline result: Even the best stack—OpenHands + Claude 3.7—passes only ~48% of tasks; environment/setup issues cause ~65% of all failures. Business twist: The Alpha value estimates net economic benefit per task by combining success, quality, and token costs. Expensive tasks become clear wins; cheap tasks require ruthless cost control. The Benchmark, de-jargoned Problem framed: In real shops, devs search, fork, and adapt. GitTaskBench simulates that reality. Each task gives an agent a specific repo (e.g., DeOldify, Scrapy, NeuroKit, SpeechBrain) and a concrete user goal (e.g., “colorize this photo” or “extract author/quote pairs into CSV”). Success is determined by a task-specific metric (e.g., NIQE for image quality; SNR/SDR for speech separation; field-level F1 for scraping; column/row fidelity for office docs) and an execution check (the thing actually runs and outputs in the right format). ...