Wheel Smarts > Wheel Reinvention: What GitTaskBench Really Measures
TL;DR for operators GitTaskBench is useful because it evaluates code agents where enterprise automation usually breaks: not in a clean coding puzzle, but inside an existing repository with dependencies, pretrained weights, fragile instructions, file formats, runtime constraints, and a user asking for a finished output.1 The paper’s headline is not “agents can code”. We have enough confetti for that parade. The sharper finding is that agents are still inconsistent at the whole delivery chain. The best reported combination, OpenHands with Claude 3.7, reaches 72.22% execution completion but only 48.15% task pass rate. In other words, many runs produce something executable, but far fewer produce something good enough. ...