Cover image

Passing Humanity's Last Exam: X-Master and the Emergence of Scientific AI Agents

Is it possible to train a language model to become a capable scientist? That provocative question lies at the heart of a new milestone in AI research. In SciMaster: Towards General-Purpose Scientific AI Agents, a team from Shanghai Jiao Tong University introduces X-Master, a tool-augmented open-source agent that has just achieved the highest score ever recorded on Humanity’s Last Exam (HLE)—surpassing even OpenAI and Google. But what makes this feat more than just a leaderboard update is how X-Master got there. Instead of training a larger model or fine-tuning on more data, the researchers innovated on agentic architecture and inference-time workflows. The result? An extensible framework that emulates the exploratory behavior of human scientists, not just their answers. ...

July 8, 2025 · 4 min · Zelina
Cover image

The Reasoning Gymnasium: How Zero-Sum Games Shape Smarter LLMs

If the future of reasoning in large language models (LLMs) doesn’t lie in human-tweaked datasets or carefully crafted benchmarks, where might it emerge? According to SPIRAL, a recent framework introduced by Bo Liu et al., the answer is clear: in games. SPIRAL (Self-Play on zero-sum games Incentivizes Reasoning via multi-Agent muLti-turn reinforcement learning) proposes that competitive, turn-based, two-player games can become a reasoning gymnasium for LLMs. It provides an automated and scalable path for cognitive skill acquisition, sidestepping human-curated data and rigid reward functions. ...

July 1, 2025 · 4 min · Zelina
Cover image

Proofs and Consequences: How Math Reveals What AI Still Doesn’t Know

What happens when we ask the smartest AI models to do something truly difficult—like solve a real math problem and prove their answer is correct? That’s the question tackled by a group of researchers in their paper “Mathematical Proof as a Litmus Test.” Instead of testing AI with casual tasks like summarizing news or answering trivia, they asked it to write formal mathematical proofs—the kind that leave no room for error. And the results? Surprisingly poor. ...

June 23, 2025 · 4 min · Zelina
Cover image

Thinking Inside the Gameboard: Evaluating LLM Reasoning Step-by-Step

LLMs are great at spitting out answers—but are they any good at thinking through problems? A new benchmark, AdvGameBench, introduces a process-based evaluation approach that places LLMs into three rule-based strategic games to measure not outcomes, but the quality of reasoning. Developed by Yuan et al., this framework focuses on how LLMs plan, revise, and make resource-limited decisions in dynamic settings. Three Games, Three Cognitive Demands 1. Tower Defense tests spatial planning and rule-following. Models place defenders on a battlefield to block enemies—positioning, cooldowns, and cost management are key. ...

June 20, 2025 · 3 min · Zelina
Cover image

Reflections in the Mirror Maze: Why LLM Reasoning Isn't Quite There Yet

In the quest for truly intelligent systems, reasoning has always stood as the ultimate benchmark. But a new paper titled “Towards a Deeper Understanding of Reasoning Capabilities in Large Language Models” by Annie Wong et al. delivers a sobering message: even the most advanced LLMs still stumble in dynamic, high-stakes environments when asked to reason, plan, and act with stability. Beyond the Benchmark Mirage Static benchmarks like math word problems or QA datasets have long given the illusion of emergent intelligence. Yet this paper dives into SmartPlay, a suite of interactive environments, to show that LLMs exhibit brittle reasoning when faced with real-time adaptation. SmartPlay is a collection of dynamic decision-making tasks designed to test planning, adaptation, and coordination under uncertainty. The team evaluates open-source models such as LLAMA3-8B, DEEPSEEK-R1-14B, and LLAMA3.3-70B on tasks involving spatial coordination, opponent modeling, and planning. The result? Larger models perform better—but only to a point. Strategic prompting can help smaller models, but also introduces volatility. ...

May 17, 2025 · 4 min

DeepSeek-R1

An open-source reasoning model achieving state-of-the-art performance in math, code, and logic tasks.

2 min