Passing Humanity's Last Exam: X-Master and the Emergence of Scientific AI Agents

TL;DR for operators

Benchmark wins usually arrive wrapped in the usual fog machine: bigger model, more data, more parameters, more destiny. The X-Master paper is more interesting because it is not mainly a bigger-model story.¹ It is a systems story.

The researchers take DeepSeek-R1-0528, a strong open-source reasoning model, and make it behave more like an agent by giving it a disciplined way to call tools during its own reasoning process. The key design choice is simple: use Python code as the interaction language. When the model needs to search, parse a paper, compute a value, or validate a hypothesis, it emits executable code; the system runs it; the result is inserted back into the context; the model continues reasoning.

Then they scale that single-agent pattern into X-Masters: five solver attempts, critic refinement, rewriting, and final selection. The reported result is 32.1% on the text-only subset of Humanity’s Last Exam, above the paper’s cited scores for OpenAI Deep Research at 26.6% and Google DeepMind Deep Research at 26.9%. That is not “AI has solved science.” It is “workflow design can move the needle materially on hard scientific QA.”

For businesses, the practical message is sharper than the leaderboard headline. Before funding custom model training, teams should ask whether they have exhausted cheaper capability levers: better tool interfaces, structured multi-pass reasoning, external verification, candidate generation, critique, synthesis, and selection. Boring plumbing, yes. Productive plumbing, also yes.

The boundary: the evaluation is text-only, benchmark-centric, dependent on web/tool reliability, and uses automated judging. The paper does not prove that X-Master can run a lab, replace domain experts, or produce reliable scientific discovery end-to-end. It shows that open-source agent orchestration can compete surprisingly well with closed research products on a difficult benchmark. That is already enough trouble for anyone selling “just buy the biggest model” as a strategy.

The result is not a new brain; it is a better research loop

The easiest way to misread X-Master is to treat it as a new scientific model. It is not. The paper’s base reasoning model is DeepSeek-R1-0528. The contribution sits around the model: tool use, context manipulation, workflow orchestration, and inference-time computation.

That distinction matters because it changes the business lesson. If the improvement came mainly from training a new frontier model, only a small number of organisations could copy the path. If the improvement comes from orchestration, more teams can experiment with the pattern using existing models, existing tools, and a slightly less romantic budget.

The authors describe X-Master as part of a broader SciMaster programme for general-purpose scientific AI agents. But Part I is less about grand scientific autonomy and more about a practical question: can an open system, using a strong reasoning model plus tool-augmented workflows, lead on a hard scientific benchmark?

Their answer is yes, with caveats attached in appropriately small print.

The benchmark is Humanity’s Last Exam, a difficult collection of expert-level questions across scientific and other knowledge domains. Because DeepSeek-R1-0528 is not multimodal in the authors’ setup, they evaluate only the text-only subset: 2,518 samples. The workflow is run three times, and the paper reports the average score. Evaluation follows the official-style setup using o3-mini as judge.

The headline result is 32.1%. That number is useful, but not self-explanatory. The real story is how the system got there.

X-Master gives the model hands, not just thoughts

A normal reasoning model can think inside its context window. X-Master tries to make the model interact with the outside world while it thinks.

The paper’s central mechanism is “code as interaction language.” During reasoning, the model can generate Python code between special markers. The system detects that code, executes it in a sandbox, and appends the execution result back into the model’s context. The model then continues reasoning with the new information.

This matters because tool use becomes more general than a fixed menu of commands. A model can use Python libraries, call custom web search and parsing tools, compute values, and potentially build small helper functions during inference. The interface is not “choose from three buttons.” It is closer to “write the procedure you need.”

The paper emphasises three advantages of this choice:

Design choice	Technical consequence	Business translation
Python as the tool interface	The model can express precise interaction steps in executable form	Teams can connect agents to existing analytics, search, parsing, and internal systems without inventing a bespoke language for every tool
Tool results reinserted into context	Reasoning and evidence gathering happen in a loop	The agent can revise its answer after seeing external data rather than hallucinating politely into the carpet
Iterative invocation	The model can search, parse, compute, and verify across multiple steps	Complex knowledge work becomes a workflow, not a single prompt

The paper’s custom tools are deliberately modest: web search and web parse. Search retrieves relevant pages, snippets, related queries, and entity-style facts. Parse extracts relevant material from general web pages and scientific papers, with a fallback from HTML to PDF when needed.

That modesty is important. X-Master is not winning by stuffing the agent with hundreds of specialist tools. At least in the paper’s reported setup, it is winning by giving a strong reasoner a flexible interaction protocol and then forcing the answer through a multi-stage workflow.

The small trick that makes a non-agent act agentic

There is a wonderfully low-tech move in the paper: Initial Reasoning Guidance.

The authors argue that strong reasoning models such as DeepSeek-R1 are not naturally agentic and may not reliably follow normal prompts telling them to use tools. So instead of waiting for the model to begin its unconstrained reasoning, the system inserts first-person guidance immediately after the model’s initial thinking marker. The guidance says, in effect: I can use external tools; when I need interaction, I will generate Python code between the right tags.

It is not fine-tuning. It is not reinforcement learning. It is context steering.

This is either elegant or faintly absurd, depending on one’s tolerance for LLM psychology theatre. The useful point is that the model is nudged into a role before it starts reasoning. The system does not merely ask for a final answer. It shapes the internal procedure by which the answer is produced.

For operators, this is a reminder that agent design is not just tool integration. It is behavioural scaffolding. A capable model with poorly framed tool access may ignore its tools, overuse them, call them at the wrong time, or fail to incorporate results. X-Master’s first-person guidance is a cheap way to push the model toward a more useful execution pattern.

Cheap does not mean trivial. In agent systems, “the model knows the tool exists” and “the model uses the tool at the right moment” are different worlds. Many production failures live in that gap, wearing a small hat and calling themselves prompt engineering.

X-Masters turns one agent into a miniature review process

X-Master is the single agent. X-Masters is the workflow wrapped around it.

The workflow is scattered-and-stacked. Scattering creates breadth: multiple solver agents generate different candidate answers in parallel. Stacking creates depth: critics, rewriters, and selectors refine and consolidate those candidates.

The paper frames this as analogous to rollouts in reinforcement learning: explore multiple trajectories, then exploit the best evidence and reasoning paths. That analogy is useful as a mental model, though businesses do not need the RL terminology. They need the operational version:

Generate several plausible answers.
Critique each answer.
Rewrite by looking across the candidate pool.
Select the final answer.

The roles are simple but powerful.

Stage	Role	Likely purpose in the evidence	What it adds
Solver	Produces five initial tool-augmented answers	Main mechanism and first ablation step	Breadth, search diversity, independent reasoning paths
Critic	Reviews and corrects each candidate	Ablation evidence	Error detection and local repair
Rewriter	Synthesises the corrected candidates into new candidates	Ablation evidence and distributional analysis	Cross-candidate integration and quality amplification
Selector	Chooses the final answer	Final stage evidence	Last-mile adjudication among refined outputs

This is not conceptually exotic. It resembles how a decent analyst team works when the problem is hard: separate first drafts, internal review, synthesis, final editor. The difference is that the process happens at inference time and can be repeated across thousands of benchmark questions.

The business implication is straightforward: the “agent” is not the model alone. The agent is the model plus workflow. If procurement teams evaluate only the model endpoint, they are measuring the engine without the transmission, steering, brakes, and occasional driver intervention. A bold automotive strategy, if one enjoys ditches.

The evidence shows staged gains, not benchmark magic

The strongest part of the paper is not the headline leaderboard. It is the ablation table showing how the score accumulates across stages.

On the text-only Humanity’s Last Exam subset, the paper reports:

Configuration	Accuracy
DeepSeek-R1-0528 without tools	17.7%
Solver with tool-augmented reasoning	21.1%
Solver + Critic	25.0%
Solver + Critic + Rewriter	30.6%
Full X-Masters with Selector	32.1%

This progression matters because it makes the result interpretable. Tool use alone raises the average first-attempt score from 17.7% to 21.1%. That is useful but not enough to explain the record. The bigger jump comes from refinement: critique and rewriting lift performance to 30.6%. Final selection adds the last step to 32.1%.

So the paper’s evidence does not say, “tools solve hard science questions.” It says something more operationally useful: tools help, but structured multi-pass reasoning helps more when tools, critique, rewriting, and selection are combined.

The paper also reports a scattering-versus-stacking ablation:

Workflow variant	Accuracy	Interpretation
No scattering, stacking retained	25.5%	A single path with refinement misses the benefit of broad exploration
Scattering retained, no stacking	25.0%	Multiple answers without synthesis lose the depth advantage
Scattering + stacking	32.1%	Breadth and refinement appear complementary

This is the cleanest operational message in the paper. Parallel exploration is not enough. Sequential refinement is not enough. The gain comes from combining both.

The authors also analyse rewriting by comparing how often, among five candidate solutions, more of them are correct after rewriting. The reported pattern is that rewriting increases the frequency of higher correctness counts, especially cases where all five candidates are correct. This is not a separate thesis; it is a supporting analysis for the Rewriter stage. Its purpose is to explain why final selection becomes easier: the selector receives a better candidate pool.

That is a practical insight. Selection quality depends on candidate quality. A final judge cannot reliably rescue a pool of mostly bad answers. Rewriting improves the pool before the selector acts.

Biology is a stress test, not a coronation

The biology results are interesting because they test whether the general workflow transfers into a specialised scientific domain.

On the Biology/Medicine category of HLE, X-Masters reports 27.6% on the complete 222 text-only questions. The paper compares this with Biomni at 17.3% on 52 selected samples and STELLA at approximately 26% on 50 selected samples. The comparison is suggestive, but the sample difference matters. X-Masters is evaluated on the full text-only category, while Biomni and STELLA are reported on selected subsets. That makes the result encouraging, not perfectly symmetrical.

The TRQA-lit result is cleaner as a domain benchmark comparison. TRQA-lit is a 172-question multiple-choice benchmark for biological research tasks such as therapeutic target identification and biomedical mechanisms. The paper reports:

System	TRQA-lit accuracy
Gemini 2.5 Pro	52.9%
DeepSeek-R1	54.8%
o3-mini	57.8%
OriGene	60.1%
X-Master	62.1%
X-Masters	67.4%

The authors note that OriGene is a multi-agent system with more than 500 expert tools, while X-Master in this setup uses only two web tools. This is a useful result because it cuts against a common assumption: that domain agents mainly improve by accumulating ever-larger tool inventories.

The more precise interpretation is narrower. On this benchmark, a general tool-augmented workflow with strong reasoning and good orchestration beats more specialised alternatives. It does not prove that two generic tools are enough for actual biomedical research. Real research workflows need databases, wet-lab constraints, causal evidence, experimental design, regulatory context, and expert review. The benchmark tests difficult knowledge reasoning, not the full lifecycle of discovery.

Still, the result should make tool maximalists slightly uncomfortable. Tool count is not architecture. A warehouse full of instruments is not a lab if nobody knows when to use which one.

What the paper directly shows, and what operators can infer

The cleanest business interpretation is not “build X-Master exactly.” It is “capability can be engineered around the model.”

Paper result	What it directly shows	Cognaptus inference for business use	Boundary
X-Masters reaches 32.1% on text-only HLE	A tool-augmented, multi-stage open-source workflow can outperform cited closed research products on this benchmark	Some organisations can gain capability through workflow engineering before custom training	Benchmark performance does not equal production reliability
Tool use improves 17.7% to 21.1%	External search/parse/compute access improves first-pass accuracy	Connect agents to reliable enterprise data and tools early	Bad tools, stale data, or weak parsing will contaminate outputs
Critic and Rewriter lift performance to 30.6%	Multi-pass refinement drives major gains	Add review, repair, synthesis, and verification stages for high-stakes tasks	Latency and cost rise with every extra pass
Scattering + stacking beats either alone	Breadth and depth reinforce each other	Use candidate generation plus consolidation for ambiguous analytical work	Not every task justifies five solvers and multiple reviewers
TRQA-lit improves to 67.4%	The workflow transfers to a biology research QA benchmark	General agent orchestration may compete with heavily specialised setups in some domains	It does not replace domain-specific evidence pipelines

For enterprise teams, the most immediate use cases are not autonomous scientific discovery. They are research-heavy workflows where the answer depends on retrieval, calculation, comparison, and error checking:

technical due diligence;
scientific and patent literature review;
regulatory or standards analysis;
biomedical target landscaping;
engineering troubleshooting;
investment research with source verification;
internal knowledge-base QA where stale model memory is dangerous.

The pattern is the same: let the model reason, but do not leave it alone in its own head. Give it tools, force it to gather evidence, create multiple attempts, critique those attempts, synthesise, and select.

This is less glamorous than claiming an AI scientist has arrived. It is also more useful.

The hidden cost is inference-time compute

The paper is explicit that it bypasses extensive training. That does not mean the method is cheap in production.

X-Masters spends compute at inference time. Five solver attempts are generated. Critics refine them. Rewriters synthesise them. A selector chooses the final answer. The Solver also accesses external tools an average of three times per query while generating initial solutions. That is a lot more expensive than a single model call.

This is not a flaw; it is a trade-off. The workflow converts training cost into runtime cost. For hard questions, that may be a very good deal. For routine customer support, it may be a small bonfire made of margin.

Operators should therefore segment tasks by value and risk:

Task type	Recommended pattern
Low-value, low-risk, repetitive queries	Single model call, light retrieval, minimal review
Medium-value analytical tasks	Tool use plus one critique or verification pass
High-value technical or scientific questions	Scattered generation, critique, rewrite, selection, source logging
Regulated or safety-critical decisions	Agent output as analyst support only, with human expert approval

The paper does not report a full cost-latency analysis. That missing piece matters for deployment. A 32.1% benchmark score may be impressive, but a production agent must also answer within budget, within acceptable latency, and with auditable tool traces. The architecture points in a useful direction; it does not hand you a procurement spreadsheet.

Where the result stops

The limitations are not generic “more research is needed” wallpaper. They directly affect interpretation.

First, the HLE evaluation is text-only. That excludes multimodal questions, which matter in many scientific and engineering settings involving diagrams, microscopy images, plots, scans, equations embedded in figures, and lab outputs.

Second, the system depends on automated judging. The paper follows the official-style setup using o3-mini as judge, but judge choice always introduces some evaluation dependence. That does not invalidate the comparison, but it reminds us that leaderboard numbers are not laboratory measurements delivered by Moses.

Third, the baseline comparisons are partly taken from existing leaderboards. That is reasonable for a technical report, but it means not every competing system is necessarily run under an identical local setup controlled by the authors.

Fourth, tool reliability is a real boundary. The case studies are actually useful here. One example shows X-Master recovering when a parser fails; another shows it changing search strategy when results are irrelevant; another uses computation to test physical consistency. These cases are implementation evidence, not main benchmark evidence. Their purpose is to illustrate behaviour under tool friction. They also reveal the operational risk: agents must handle failed retrieval, irrelevant pages, contradictory sources, and numerical mistakes without collapsing into confident nonsense.

Fifth, 32.1% is still 32.1%. It leads the cited comparison, but most questions remain unsolved. The benchmark is hard, yes. Still, businesses should not translate “state of the art” into “ready for unsupervised scientific decision-making.” That would be less strategy than interpretive gymnastics.

The broader lesson: agent design is becoming the product layer

X-Master points to a shift that matters beyond this paper. As base models become stronger and more widely available, differentiation increasingly moves into the system layer: tool interfaces, memory, retrieval, verification, orchestration, evaluation, and role design.

That is uncomfortable for teams hoping that model choice alone will solve their AI roadmap. It will not. The paper’s ablations show that the same underlying reasoner behaves very differently depending on the workflow around it. A static DeepSeek-R1-0528 baseline scores 17.7%; the full X-Masters workflow reaches 32.1%. The gap is not magic. It is architecture.

For businesses, this suggests a practical sequence:

Start with a strong available reasoning model.
Add high-quality tool access to trusted data and computation.
Make tool use part of the reasoning loop, not an afterthought.
Generate multiple candidate answers for hard tasks.
Critique and rewrite before final selection.
Evaluate stage by stage, not only at the final answer.
Reserve full scattered-and-stacked workflows for questions worth the runtime cost.

The important word is “stage.” X-Master is valuable because the paper decomposes performance improvement. Tool use contributes. Critique contributes. Rewriting contributes. Selection contributes. Scattering and stacking each matter, but neither is sufficient alone.

This is how agent engineering becomes less mystical. You stop asking whether the model is “smart enough” in the abstract and start asking which part of the workflow fails: retrieval, calculation, critique, synthesis, or final judgement.

Conclusion: scientific agents will be built as systems, not summoned as models

X-Master does not prove that open-source AI has “passed” Humanity’s Last Exam in the ordinary sense. A 32.1% score is not a graduation ceremony. But it does show that open-source, tool-augmented, inference-time orchestration can compete with closed research systems on a demanding benchmark.

That is the useful signal.

The paper’s main contribution is not a dazzling new algorithm. It is know-how: how to turn a non-agentic reasoning model into a tool-using agent, then compound its output through parallel exploration, critique, rewriting, and selection. This is exactly the kind of progress that matters for practical AI systems because it is implementable, inspectable, and improvable.

The next generation of scientific AI will not be defined only by the biggest model. It will be defined by the quality of the loop around the model: what it can inspect, what it can compute, how it corrects itself, how it compares alternatives, and when it admits that the evidence is not enough.

A model answers. An agent investigates. X-Master is one more sign that the product frontier is moving from answer generation to managed investigation. About time.

Cognaptus: Automate the Present, Incubate the Future.

Jingyi Chai, Shuo Tang, Rui Ye, Yuwen Du, Xinyu Zhu, Mengcheng Zhou, Yanfeng Wang, Weinan E, Yuzhi Zhang, Linfeng Zhang, and Siheng Chen, “SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation — Can We Lead on Humanity’s Last Exam?”, arXiv:2507.05241, 2025, https://arxiv.org/abs/2507.05241. ↩︎

TL;DR for operators#

The result is not a new brain; it is a better research loop#

X-Master gives the model hands, not just thoughts#

The small trick that makes a non-agent act agentic#

X-Masters turns one agent into a miniature review process#

The evidence shows staged gains, not benchmark magic#

Biology is a stress test, not a coronation#

What the paper directly shows, and what operators can infer#

The hidden cost is inference-time compute#

Where the result stops#

The broader lesson: agent design is becoming the product layer#

Conclusion: scientific agents will be built as systems, not summoned as models#