Cover image

Wheel Smarts > Wheel Reinvention: What GitTaskBench Really Measures

Agents don’t build Rome from scratch—they retrofit the city. GitTaskBench (arXiv:2508.18993) is the first benchmark that grades code agents on how well they exploit existing GitHub repositories to deliver real-world outcomes, not just pass algorithm puzzles. It also puts a price tag on success via an Alpha value that blends accuracy with cost, bringing long-missing business realism to agent evals. TL;DR What’s new: 54 tasks across 7 modalities (image, video, speech, office docs, web scraping, security/privacy, biosignals), each paired to a real repo and a practical, automated test harness. Why it matters: The hard part isn’t just writing code—it’s environment setup, dependency wrangling, repo comprehension, and workflow orchestration. Headline result: Even the best stack—OpenHands + Claude 3.7—passes only ~48% of tasks; environment/setup issues cause ~65% of all failures. Business twist: The Alpha value estimates net economic benefit per task by combining success, quality, and token costs. Expensive tasks become clear wins; cheap tasks require ruthless cost control. The Benchmark, de-jargoned Problem framed: In real shops, devs search, fork, and adapt. GitTaskBench simulates that reality. Each task gives an agent a specific repo (e.g., DeOldify, Scrapy, NeuroKit, SpeechBrain) and a concrete user goal (e.g., “colorize this photo” or “extract author/quote pairs into CSV”). Success is determined by a task-specific metric (e.g., NIQE for image quality; SNR/SDR for speech separation; field-level F1 for scraping; column/row fidelity for office docs) and an execution check (the thing actually runs and outputs in the right format). ...

August 27, 2025 · 5 min · Zelina
Cover image

Beyond Stack Overflow: CodeAssistBench Exposes the Real Gaps in LLM Coding Help

The Trouble With Stack Overflow-Style Benchmarks Large language models (LLMs) have been hailed as revolutionizing programming workflows. But most coding benchmarks still test them like they’re junior devs solving textbook exercises. Benchmarks such as HumanEval, MBPP, and even InfiBench focus on code synthesis in single-turn scenarios. These tests make models look deceptively good — ChatGPT-4 gets 83% on StackEval. Yet in real development, engineers don’t just ask isolated questions. They explore, revise, troubleshoot, and clarify — all while navigating large, messy codebases. ...

July 16, 2025 · 4 min · Zelina