Benchmarks Are From Mars, Workflows Are From Venus: Why AI Research Co‑Pilots Keep Failing in the Wild
Lab meeting. The principal investigator cuts the validation budget from $15,000 to $5,000. The postdoc has already discussed the original plan with an AI research co-pilot. The agent previously suggested a 10-marker flow cytometry panel, bulk RNA-seq validation, and immunofluorescence. Now the researcher returns and says: we need to prioritize. A useful co-pilot should not simply repeat the original protocol with a smaller price tag. It should remember the hypothesis, preserve the scientific goal, understand the new constraint, propose a cheaper validation path, and know which evidence can be deferred without making the proposal look scientifically flimsy. In other words, it must behave less like a brilliant autocomplete box and more like a collaborator with a working memory, a sense of context, and a modest respect for reality. A rare feature, apparently. ...