Scientific Discovery

The Benchmark Awakens: AstaBench and the New Standard for Agentic Science

The Benchmark Awakens: AstaBench and the New Standard for Agentic Science The latest release from the Allen Institute for AI, AstaBench, represents a turning point for how the AI research community evaluates large language model (LLM) agents. For years, benchmarks like MMLU or ARC have tested narrow reasoning and recall. But AstaBench brings something new—it treats the agent not as a static model, but as a scientific collaborator with memory, cost, and strategy. ...

The Missing Link: How AI Maps Hidden Properties in Materials Science

The search for new superconductors, energy materials, and exotic compounds often begins not in a lab—but in a database. Yet despite decades of digitization, scientific knowledge remains fragmented across millions of papers, scattered ontologies, and uncharted connections. A new study from Los Alamos National Laboratory proposes an AI-driven framework that doesn’t just analyze documents—it predicts the next breakthrough. From Papers to Properties: A Three-Tiered Approach At the heart of this method is a clever ensemble pipeline that combines interpretability with predictive power. The authors start by mapping over 46,000 papers on transition-metal dichalcogenides (TMDs)—a key class of 2D materials—into a matrix of latent topics and material mentions. Then they apply a hierarchical modeling approach: ...

Passing Humanity's Last Exam: X-Master and the Emergence of Scientific AI Agents

Is it possible to train a language model to become a capable scientist? That provocative question lies at the heart of a new milestone in AI research. In SciMaster: Towards General-Purpose Scientific AI Agents, a team from Shanghai Jiao Tong University introduces X-Master, a tool-augmented open-source agent that has just achieved the highest score ever recorded on Humanity’s Last Exam (HLE)—surpassing even OpenAI and Google. But what makes this feat more than just a leaderboard update is how X-Master got there. Instead of training a larger model or fine-tuning on more data, the researchers innovated on agentic architecture and inference-time workflows. The result? An extensible framework that emulates the exploratory behavior of human scientists, not just their answers. ...