The Missing Metric: Measuring Agentic Potential Before It’s Too Late
In the modern AI landscape, models are not just talkers—they are becoming doers. They code, browse, research, and act within complex environments. Yet, while we’ve become adept at measuring what models know, we still lack a clear way to measure what they can become. APTBench, proposed by Tencent Youtu Lab and Shanghai Jiao Tong University, fills that gap: it’s the first benchmark designed to quantify a model’s agentic potential during pre-training—before costly fine-tuning or instruction stages even begin.
Why Agentic Potential Matters
Most existing pre-training benchmarks—MMLU, GSM8K, EvalPlus—evaluate static skills like knowledge recall, math reasoning, or code syntax. These tell us what the model knows, but not what it can do autonomously. When such base models are later fine-tuned into agents, their downstream performance in dynamic, multi-step environments (like SWE-Bench for coding or DeepResearch for information synthesis) varies wildly, even when they scored similarly on traditional benchmarks.
APTBench exposes this blind spot. The researchers show that base models with nearly identical MMLU or GSM8K scores differ by over 30 points when tested on real agentic tasks. In other words, today’s benchmarks predict textbook intelligence, not workplace performance.
Turning Trajectories into Tests
The core innovation of APTBench lies in its conversion pipeline: it transforms real-world agent trajectories into lightweight, model-readable evaluation items. Rather than asking a base model to perform a full multi-turn task (which it can’t yet do), APTBench distills those trajectories into two types of questions:
| Type | Format | Evaluates |
|---|---|---|
| Planning | Multiple-choice | Can the model choose the right next step in a complex sequence? |
| Action | Text completion | Can it produce the correct command or response for the next move? |
Each question is derived from real human or agent trajectories—covering domains like software engineering (environment setup, bug fixing) and deep research (multi-hop search, evidence synthesis). Incorrect answers aren’t random—they’re plausible but wrong, generated by systematically degrading the true trajectory (e.g., reordering steps, omitting key actions, or injecting subtle logic errors). This ensures that success reflects genuine procedural reasoning, not memorization.
A Two-Domain Lens: Code and Research
APTBench consists of two complementary benchmarks:
-
APTBench-SWE: Tests planning and action abilities in software engineering. Tasks include choosing environment setup steps, writing bash commands, or selecting bug fixes from real GitHub repositories. Atomic skills like error handling or bug localization are also evaluated.
-
APTBench-DR: Tests agentic reasoning in deep research scenarios. Closed-ended questions check if a model can plan and answer through multi-hop search, while open-ended ones measure its ability to structure long-form reports and cite evidence correctly.
This dual structure captures both precise procedural execution (SWE) and strategic synthesis under uncertainty (DR)—the two cognitive pillars of effective AI agents.
Emergence Before Fine-Tuning
APTBench reveals a striking phenomenon: agentic emergence happens at a size threshold. Models below roughly 4 billion parameters (like Qwen3-1.7B) fail almost completely at planning and action tasks. But once past that threshold, agentic abilities start to stabilize and scale predictably.
Even more surprising, medium-sized models (around 30–40B) like Seed-OSS-36B often match or outperform massive 100B+ models such as DeepSeek-V3.1 or Kimi-K2. The deciding factor isn’t architecture—it’s data. Models pre-trained on agent-oriented corpora consistently outperform larger peers trained on generic text. APTBench quantifies what many developers already suspected: pre-training data quality, not just scale, drives agent readiness.
Predicting the Future Agent
APTBench’s true power lies in prediction. When researchers plotted APTBench scores of base models against their fine-tuned versions’ performance on SWE-Bench Verified, the correlation coefficient reached 0.84—far higher than with traditional metrics. This means a model’s agentic potential can now be estimated before spending millions on post-training cycles.
By filtering out long-context tasks (which currently disadvantage models with limited sequence windows), the correlation became even stronger. This not only validates APTBench’s sensitivity but also highlights how long-context reasoning is the next frontier for agentic pre-training.
The Strategic Implication: Pre-train Smarter, Not Longer
APTBench isn’t just an academic curiosity—it’s a practical tool for LLM developers. It allows pre-training teams to:
- Monitor agentic skill formation in real-time rather than waiting until post-training.
- Compare architectures and data mixes based on agentic outcomes, not static benchmarks.
- Adjust trajectories toward long-context, feedback-driven learning before costly missteps.
In short, APTBench turns pre-training from blind navigation into instrumented exploration. It signals a broader shift: evaluation itself is becoming agentic. The benchmark doesn’t just measure— it adapts to the behaviors that define intelligent action.
Final Thought
In AI, foresight is everything. APTBench gives model developers that foresight—a way to measure not just what a model is, but what it could become. As agentic AI evolves from a frontier into a foundation, pre-training without such insight will feel increasingly like flying blind.
Cognaptus: Automate the Present, Incubate the Future.