Benchmarks Are From Mars, Workflows Are From Venus: Why AI Research Co‑Pilots Keep Failing in the Wild
Opening — Why this matters now Benchmarks are having a moment. Every few weeks, a new leaderboard appears claiming to measure a model’s research capability—from literature recall to CRISPR planning. And yet, inside real laboratories, scientists quietly report a different truth: systems that ace these benchmarks often become surprisingly helpless when asked to collaborate across days, adapt to constraints, or simply remember that the budget shrank since yesterday. ...