Reproducibility

Stable World Models, Unstable Benchmarks: Why Infrastructure Is the Real Bottleneck

Opening — Why this matters now World Models are having a quiet renaissance. Once framed as a curiosity for imagination-driven agents, they are now central to planning, robotics, and representation learning. Yet for all the architectural creativity, progress in the field has been oddly brittle. Results are impressive on paper, fragile in practice, and frustratingly hard to reproduce. ...

The Benchmark Awakens: AstaBench and the New Standard for Agentic Science

The Benchmark Awakens: AstaBench and the New Standard for Agentic Science The latest release from the Allen Institute for AI, AstaBench, represents a turning point for how the AI research community evaluates large language model (LLM) agents. For years, benchmarks like MMLU or ARC have tested narrow reasoning and recall. But AstaBench brings something new—it treats the agent not as a static model, but as a scientific collaborator with memory, cost, and strategy. ...

Automate All the Things? Mind the Blind Spots

Automation is a superpower—but it’s also a blindfold. New AI “scientist” stacks promise to go from prompt → idea → code → experiments → manuscript with minimal human touch. Today’s paper shows why that convenience can quietly erode scientific integrity—and, by extension, the credibility of any product decisions built on top of it. The punchline: the more you automate, the less you see—unless you design for visibility from day one. ...

From PDF to PI: Turning Papers into Productive Agents

We’ve all met the paper that promises the moon—then hands you a README, a maze of conda environments, and a prayer. Paper2Agent proposes a different contract: don’t read me, run me. By converting a research paper (and its repo) into a Model Context Protocol (MCP) server that any LLM agent can call, it turns methods into tools, figures into reproducible tests, and “future work” into executable prompts. This isn’t another “Papers with Code” link farm. It’s a pipeline that (1) mines the repo/tutorials, (2) builds a pinned environment, (3) extracts single‑purpose tools with clear I/O, (4) tests them until they match the paper’s outputs, and (5) deploys the lot as a remote MCP server. Hook that server to your favorite coding/chat agent and you get a paper‑specific copilot that can reproduce, explain, and extend the work. ...