Passing Humanity's Last Exam: X-Master and the Emergence of Scientific AI Agents

Is it possible to train a language model to become a capable scientist?

That provocative question lies at the heart of a new milestone in AI research. In SciMaster: Towards General-Purpose Scientific AI Agents, a team from Shanghai Jiao Tong University introduces X-Master, a tool-augmented open-source agent that has just achieved the highest score ever recorded on Humanity’s Last Exam (HLE)—surpassing even OpenAI and Google.

But what makes this feat more than just a leaderboard update is how X-Master got there. Instead of training a larger model or fine-tuning on more data, the researchers innovated on agentic architecture and inference-time workflows. The result? An extensible framework that emulates the exploratory behavior of human scientists, not just their answers.

X-Master: Reasoning with Tools, Like a Scientist

At its core, X-Master is built on a deceptively simple idea: that scientific reasoning often requires stepping outside the model’s own head.

Most LLMs hallucinate when they hit a knowledge boundary. X-Master, by contrast, recognizes when it’s stuck—and writes Python code to search, analyze, or retrieve the information it needs. This is possible because the agent is designed around “code as interaction language”: when it needs to access the web, parse data, or calculate something complex, it generates executable code blocks mid-thought.

<code>
from toolkit import web_search
result = web_search("Who is the best-performing agent on MLE-Bench?")
</code>

This form of tool-use isn’t hardcoded; it’s emergent. By injecting self-guidance statements at the start of each query (e.g., “I will use tools when needed”), the base model (DeepSeek-R1-0528) begins to act more like an agent. No additional finetuning is required.

Scattered and Stacked: A Workflow That Thinks Like a Lab

Tool use alone isn’t enough. Scientific thinking also demands iteration, criticism, synthesis, and selection.

To scale up its reasoning capacity, the team designed an agentic workflow called X-Masters, where the base agent is cloned into different roles:

Role	Function
Solver	Generates five diverse answers in parallel
Critic	Reviews and corrects each answer individually
Rewriter	Synthesizes five corrected solutions into new ones
Selector	Picks the best final answer based on consistency

This “scattered-and-stacked” pattern mirrors how a research lab works: brainstorm broadly, critique internally, integrate promising ideas, then converge.

The analogy to reinforcement learning rollouts is apt. Scattering = exploration; stacking = exploitation. Yet here, the entire process happens at inference time. That means it can be applied to any open-source model without retraining.

The Benchmark: Humanity’s Last Exam

Developed by over 500 institutions, HLE is the most demanding evaluation of AI scientific reasoning to date, spanning math, physics, biology, engineering, and social sciences. While leading systems like OpenAI’s Deep Research and Gemini 2.5 achieved scores around 26%, X-Masters set a new record: 32.1%.

More importantly, the performance gains came from system design, not more compute or training. In an ablation study:

Tool-augmented reasoning alone improved accuracy from 17.7% to 21.1%
Adding critics and rewriters pushed it to 30.6%
Final selection edged it to 32.1%

This demonstrates that structured multi-agent workflows can dramatically enhance reasoning quality even with static LLMs.

Implications: From Benchmarks to Biology

While X-Masters dominated HLE as a generalist, it also outperformed purpose-built agents like Biomni and STELLA on biomedical tasks—despite using only two generic tools (web search and web parse). On the TRQA-lit benchmark, it scored 67.4%, beating systems using over 500 domain-specific tools.

This suggests a path forward for leaner, more general scientific agents:

Better workflows can beat bigger models
Fewer tools, if well-integrated, can outcompete large toolkits
Open-source infrastructure can rival proprietary giants

The researchers promise more in the SciMaster series, including specialized agents for literature mining, simulation, and experiment planning. But even Part I delivers a powerful message: you don’t need to be OpenAI to build world-class AI.

Final Thoughts

The arrival of X-Master marks a turning point: from static answer machines to interactive, tool-wielding reasoners. It models not just scientific knowledge, but the behavior of scientific inquiry. And crucially, it does so in a fully open and replicable way.

As we move into an era of AI-augmented discovery, frameworks like X-Masters may prove just as important as the models themselves.

Cognaptus: Automate the Present, Incubate the Future.

X-Master: Reasoning with Tools, Like a Scientist#

Scattered and Stacked: A Workflow That Thinks Like a Lab#

The Benchmark: Humanity’s Last Exam#

Implications: From Benchmarks to Biology#

Final Thoughts#

X-Master: Reasoning with Tools, Like a Scientist

Scattered and Stacked: A Workflow That Thinks Like a Lab

The Benchmark: Humanity’s Last Exam

Implications: From Benchmarks to Biology

Final Thoughts