AIRS-Bench: When AI Starts Doing the Science, Not Just Talking About It
Opening — Why this matters now For years, AI progress has been narrated through a familiar ritual: introduce a new benchmark, top it with a new model, declare victory, repeat. But as large language models graduate from single-shot answers to multi-step agentic workflows, that ritual is starting to crack. If AI systems are now expected to design experiments, debug failures, iterate on ideas, and judge their own results, then accuracy on static datasets is no longer the right yardstick. ...