Cover image

AIRS-Bench: When AI Starts Doing the Science, Not Just Talking About It

Opening — Why this matters now For years, AI progress has been narrated through a familiar ritual: introduce a new benchmark, top it with a new model, declare victory, repeat. But as large language models graduate from single-shot answers to multi-step agentic workflows, that ritual is starting to crack. If AI systems are now expected to design experiments, debug failures, iterate on ideas, and judge their own results, then accuracy on static datasets is no longer the right yardstick. ...

February 9, 2026 · 3 min · Zelina
Cover image

Sequential Beats Parallel: When Deep Research Agents Learn to Reflect

Opening — Why this matters now The last year has been crowded with so-called deep research agents. Everyone parallelizes. Everyone fans out queries. Everyone promises doctoral-level synthesis at web speed. And yet, the leaderboard keeps telling an inconvenient story: throwing more parallel agents at a problem does not reliably buy depth. The paper “Deep Researcher with Sequential Plan Reflection and Candidates Crossover” enters this debate with a pointed thesis: research is not a map-reduce problem. If you want insight, you need memory, reflection, and the ability to change your mind mid-flight. ...

January 31, 2026 · 4 min · Zelina