The Missing Metric: Measuring Agentic Potential Before It’s Too Late

Procurement teams love a leaderboard. It is tidy, numeric, comparable, and therefore dangerously comforting. A model scores well on MMLU, looks respectable on GSM8K, passes a coding benchmark, and suddenly someone in a meeting says it is “agent-ready.” Lovely. By that logic, a person who passes a written driving test should be handed the keys to a forklift in a crowded warehouse.

The problem is not that traditional benchmarks are useless. They measure something real: knowledge recall, mathematical reasoning, code completion, and other isolated competencies. The problem is that agentic systems do not live inside isolated questions. They plan, act, observe feedback, revise their path, call tools, recover from errors, and continue doing all of this long after a single-turn benchmark has politely gone home.

A recent paper from Tencent Youtu Lab and Shanghai Jiao Tong University proposes APTBench, a benchmark designed to measure the agentic potential of base language models during pre-training.1 That timing matters. The authors are not asking whether a finished instruction-tuned agent can solve a task. They are asking whether the base model already contains the ingredients that later post-training can turn into a capable agent.

That is the expensive question. Once pre-training is done, discovering that the model has poor agentic foundations is not a fun little surprise. It is the sort of surprise that arrives wearing a finance department badge.

The benchmark trick: turn agent trajectories into base-model questions

APTBench begins with a practical asymmetry. End-to-end agent benchmarks are built for post-trained models. They assume the model can follow complex instructions, run multi-turn workflows, call external tools, and manage interaction history. Base models, by definition, are not yet good at that interface.

So the authors do not force base models to behave like deployed agents. Instead, they convert successful agent trajectories into questions that base models can answer.

The mechanism is simple in outline and subtle in execution:

  1. Collect real-world tasks and successful trajectories.
  2. Identify the agentic ability being tested.
  3. Extract the correct next plan, action, or domain-specific decision from the trajectory.
  4. Convert it into either a multiple-choice question or a text-completion task.
  5. Generate plausible wrong answers by degrading the correct answer.
  6. Validate the resulting items.

The important move is that APTBench preserves the structure of agent work without requiring the base model to run the whole agent loop. A model does not need to execute a full GitHub issue fix inside Docker. It can instead be asked: given the issue, the previous trajectory, and the observed environment feedback, what should the next plan or command be?

That is not a toy proxy. It is a measurement compromise. The benchmark takes the expensive, messy, multi-turn thing and distils it into a form that can be scored during pre-training.

Here is the basic mapping:

Agent behaviour in the wild APTBench probe What it tests
Breaking a task into steps Planning question Whether the model recognises the right decomposition or next step
Calling a terminal or search tool Action completion Whether the model can produce the correct command, tool call, or concise answer
Handling domain-specific friction Atomic ability question Whether the model can do the small but decisive skill a domain requires
Recovering from errors Error-handling choice Whether the model can distinguish a valid repair path from plausible nonsense
Citing evidence in a report Citation-support question Whether the model can connect claims to supporting source content

This is the paper’s main contribution. APTBench is not merely “another benchmark,” that most renewable of academic resources. It is a conversion method: from trajectory to probe, from agent behaviour to base-model signal.

Why static benchmarks miss agent behaviour

The paper’s motivating evidence is a familiar but still useful slap in the face: general benchmark scores correlate weakly with downstream agent performance.

The authors compare base model scores on MMLU, EvalPlus, and GSM8K with the corresponding instruct models’ performance on SWE-bench Verified. The relationship is unimpressive. For SWE-bench Verified, the reported Pearson correlations are $r = 0.38$ for MMLU, $r = 0.17$ for EvalPlus, and $r = -0.28$ for GSM8K. The paper also notes that six models clustered between 86 and 88 on MMLU differ by about 30 points on SWE-bench.

This is not a strange result once you think about the task boundary. MMLU asks whether the model has broad knowledge. GSM8K asks whether it can solve grade-school mathematical word problems. EvalPlus asks whether it can generate correct code against test cases. An agent fixing software issues has to inspect a repository, infer what matters, choose the next action, interpret tool feedback, and avoid thrashing. Those are related to knowledge and coding, but they are not reducible to them.

The misconception APTBench attacks is therefore specific: a strong base model on general benchmarks is not automatically a strong foundation for agents. It may be verbally competent, mathematically tidy, and still bad at choosing the next move in a dynamic workflow. The difference is not intelligence versus stupidity. It is static skill versus situated control.

That matters for business because many evaluation processes still treat agent readiness as a derivative of generic model quality. “This model is high-scoring; therefore our agent will be strong” is not a strategy. It is a procurement ritual with better typography.

What APTBench actually contains

APTBench covers two domains: software engineering and deep research. These are sensible choices because both are commercially important and both expose the planning-action-feedback loop.

APTBench-SWE contains 3,727 questions across two software tasks: environment setup and issue fixing.

Environment setup uses 489 GitHub repositories linked to recent ICLR, NeurIPS, and ICML papers. The benchmark asks models to choose setup plans, produce next bash commands, and handle setup errors based on issue threads. This is mundane in the best possible way. Many automation failures do not happen because the model cannot philosophise about code. They happen because it installs the wrong dependency, runs a command too early, misses a setup note, or cannot interpret an error thread.

Issue fixing uses successful trajectories from SWE-Smith and seed data from SWE-bench Lite. It probes stepwise planning, next-command generation, bug localisation, fix-patch selection, and test-patch selection. The negative choices are not random distractions. They are often generated from future trajectory steps, failed patches, overlapping code snippets, or systematically corrupted solutions.

APTBench-DR contains 2,255 questions across closed-ended and open-ended research tasks. Closed-ended questions use InfoDeepSeek-style search and browsing trajectories in English and Chinese. The model must choose reasonable next plans or produce concise final answers from the trajectory. Open-ended questions draw from DeepResearch Bench and Researchy Questions. The benchmark probes report planning, report selection, and citation support.

The citation task is especially revealing. The model sees a report, a cited webpage, and statements from the report. It must identify which statements are actually supported by that webpage. This is not glamorous. It is also exactly where many “deep research” agents quietly turn into well-formatted rumour mills.

A compressed view of the benchmark looks like this:

Subset Task family Main evidence role What it supports What it does not prove
APTBench-SWE EnvSetup Repository setup, commands, error handling Main benchmark construction Base models can be probed for procedural software readiness That the model can autonomously set up every repository end-to-end
APTBench-SWE IssueFix Planning, commands, bug location, patches, tests Main benchmark construction Trajectory-derived probes can represent core coding-agent moves That the base model can run a complete agent framework
APTBench-DR closed-ended QA Search planning and concise answers Main benchmark construction Base models can be evaluated on search-trajectory reasoning That they can independently browse reliably in deployment
APTBench-DR open-ended QA Report planning, report selection, citation support Main benchmark construction plus long-context stress Base models can be tested on synthesis and grounding decisions That multiple-choice report selection equals report generation quality
Removing long-context tasks Correlation sensitivity test Robustness/sensitivity evidence Long-context ability confounds some APTBench scores That long context is optional for real agents

That last row is important. The paper does not merely report a benchmark score and bow. It also checks what happens when very long-context tasks are removed. That analysis is not a second thesis; it is a sensitivity test. It helps separate “agentic potential” from “can this model survive a 128K-token prompt without forgetting why it came here?”

The results: agentic ability appears, but not evenly

The experiments cover a range of open base models, from small dense models to large mixture-of-experts systems. The paper’s most interpretable size comparison is the Qwen3 series.

On APTBench-SWE, Qwen3-1.7B averages 24.27, while Qwen3-4B, Qwen3-8B, and Qwen3-30B-A3B score 38.75, 41.62, and 41.60 respectively. On APTBench-DR, the same sequence is 28.52, 40.50, 42.35, and 45.55.

The authors interpret this as evidence of a size threshold: below a certain scale, agentic capabilities do not reliably emerge; above it, the gains flatten. That interpretation is plausible within the tested models, but it should not be over-generalised into a universal law of parameter counts. The evidence is strongest as a comparative observation inside the tested families.

The more commercially interesting result is that scale alone is not the story. Seed-OSS-36B performs strongly: 49.93 on APTBench-SWE and 61.56 on APTBench-DR. It competes with, and sometimes beats, much larger models on several subtasks. DeepSeek-V3.1 leads APTBench-DR overall at 66.42, while Kimi K2 leads APTBench-SWE overall at 52.56, but the medium-sized Seed model is clearly not embarrassed by the giants.

The paper also compares similarly sized models and argues that training data alignment is critical. Among 3–4B dense models, Qwen3-4B outperforms SmolLM3-3B by 59.2% on SWE and 48.8% on DR. Among large MoE models, GLM-4.5-Air leads Llama4-Scout by 20.1% on SWE and 18.8% on DR. The authors attribute this gap mainly to whether pre-training data has been optimised for agent-centric scenarios.

This is where the paper becomes operationally useful. For builders, APTBench suggests that agent readiness is not simply bought by adding parameters. It is shaped by the pre-training mixture: long trajectories, tool-like behaviour, procedural structure, feedback-conditioned decision-making, and domain-specific evidence use.

The dull version is “data matters.” The sharper version is: agentic pre-training data matters before the model has learned to behave politely in a chat window.

APTBench predicts downstream agent performance better than the usual suspects

The paper’s strongest evidentiary move is the correlation analysis between base-model APTBench scores and instruct-model performance on SWE-bench Verified.

For APTBench-SWE, the reported correlation with SWE-bench Verified is $r = 0.69$ with $p = 0.057$. When long-context tasks are removed, the correlation rises to $r = 0.84$ with $p = 0.009$. For APTBench-DR, the corresponding values are $r = 0.78$ with $p = 0.023$, and $r = 0.87$ with $p = 0.005$ after removing long-context tasks.

That pattern has two readings.

The first is the headline: APTBench is more predictive of downstream agent performance than MMLU, EvalPlus, or GSM8K in this comparison. The benchmark seems to capture something closer to the latent capabilities that post-training later exposes.

The second is more nuanced: long-context handling is both part of agent work and a confound in measurement. Some APTBench items are very long, especially IssueFix planning/action and open-ended deep-research action/citation tasks. If a model struggles with long context, it may underperform not because it lacks agentic decision-making, but because it cannot reliably process the input window. Removing those items strengthens the correlation with SWE-bench Verified.

This does not mean long-context tasks are bad. Quite the opposite. Real agents often operate in long contexts. But for diagnosis, APTBench’s long-context sensitivity tells evaluators what kind of weakness they are seeing. A model may have poor procedural judgement. Or it may have adequate procedural judgement trapped behind brittle context processing. Those are different engineering problems.

Business teams should care about that distinction. One leads to better agentic pre-training data. The other leads to long-context training, retrieval design, memory architecture, or context compression. Same failure symptom, different repair budget.

The practical value is earlier diagnosis, not benchmark theatre

APTBench’s business relevance is not that companies now have one more number to paste into a slide deck. The world has enough decorative metrics. The value is earlier diagnosis.

For model developers, the path is straightforward. During pre-training or continual pre-training, APTBench-style probes can test whether a model is developing the planning, action, error-handling, and evidence-grounding capabilities needed for agent applications. If a data mix improves MMLU but worsens trajectory-derived planning, that is not a minor detail. It may mean the model is becoming better at exams and worse at work.

For AI buyers, the lesson is slightly different. APTBench is not a plug-and-play procurement standard for every enterprise use case. It is a warning against treating generic leaderboards as evidence of agent readiness. If a vendor claims their model is suited for coding agents, research agents, or workflow automation, the evaluation should include trajectory-level probes, not just knowledge and coding scores.

For product teams building agent systems, APTBench also suggests a diagnostic template:

Business question APTBench-style evaluation analogue Decision it informs
Can this model choose the next useful step? Stepwise planning from prior trajectory and feedback Agent orchestration and planner selection
Can it call tools precisely? Text completion for commands or structured actions Tool-use reliability and guardrail design
Can it recover from routine operational errors? Error-handling choices from repository issues Automation resilience
Can it distinguish a correct patch from a plausible failed one? FixPatch and TestPatch selection Coding-agent model selection
Can it ground claims in source material? Citation support recognition Research-agent trust and review workflow
Is failure due to weak judgement or weak context handling? Long-context sensitivity comparison Architecture and data strategy

This is not the same as proving ROI. A benchmark can tell you that a model has stronger foundations for a class of agent tasks. It cannot tell you whether your customer-support automation will reduce cost by 17%, whether your developers will accept the workflow, or whether your compliance team will faint upon seeing the logs. Those require deployment studies.

But as a pre-deployment and pre-training signal, APTBench is useful precisely because it is cheaper than discovering the defect after a full agent system has already been built around the wrong model.

The boundary: what the paper shows, and what it does not

The paper’s evidence is strongest for the claim that trajectory-derived probes are better aligned with downstream agent performance than conventional static benchmarks. It shows this across software engineering and deep research tasks, with detailed construction methods and correlations against SWE-bench Verified.

Several boundaries matter.

First, APTBench is built around two domains. Software engineering and deep research are important, but they are not the whole agent economy. A model that performs well here may not automatically excel at robotics control, enterprise resource planning, medical workflow assistance, sales operations, or financial analysis. The construction method may generalise more easily than the benchmark itself.

Second, many distractors are generated by LLMs and then validated. This is reasonable and scalable, but it means benchmark quality depends on the quality of degradation rules, generation prompts, filtering, and human validation. Bad distractors can turn a hard agentic judgement into pattern recognition. The paper is careful about this, but future benchmark users should remain awake. An underrated professional habit, frankly.

Third, the downstream validation leans heavily on SWE-bench Verified because there is no equally standardised deep-research agent benchmark for this purpose. The authors use SWE-bench as an accepted proxy for agent performance. That is sensible, but it means the correlation evidence is more directly anchored in software engineering than in all agentic behaviour.

Fourth, the benchmark evaluates base models mostly through multiple-choice and text-completion formats. This is the point: base models cannot yet run full agents. But it also means APTBench measures potential, not deployed competence. Post-training, scaffolding, tool design, retrieval, memory, and environment integration can still change outcomes.

Finally, the long-context finding cuts both ways. Removing long-context tasks increases correlation with SWE-bench Verified, but long context remains essential for many real agent workloads. Teams should not interpret the “without long context” score as the cleaner truth and discard the rest. They should treat the gap between full and reduced benchmarks as a diagnostic clue.

What changes for enterprise evaluation

APTBench points toward a more mature evaluation stack.

At the bottom layer are static capability tests: knowledge, math, coding, language, and reasoning. These remain useful. Above that should sit trajectory-derived potential tests: can the base model recognise good plans, actions, patches, citations, and recovery strategies inside real workflows? Above that comes full agent evaluation: can the post-trained model, inside a real scaffold, complete tasks end-to-end under operational constraints?

Most organisations currently overuse the bottom layer and underinvest in the middle. Then they jump to end-to-end pilots and act shocked when the agent fails in ways the initial benchmark never had a chance to detect.

APTBench makes the missing middle explicit. It is not a replacement for end-to-end evaluation. It is a filter before end-to-end evaluation becomes expensive. It asks whether the model has the procedural instincts worth post-training in the first place.

That is the strategic value: not “train longer,” but “measure earlier.” Not “choose the biggest model,” but “choose the model whose base competencies match the agent you intend to build.” Not “trust the leaderboard,” but “test the loop.”

Conclusion: the agent is already hiding in the base model

The cleanest insight from APTBench is that agent quality does not begin at deployment, or even at instruction tuning. It begins earlier, in the base model’s exposure to the structure of work: plans, actions, feedback, failures, repairs, and evidence.

Traditional benchmarks ask whether a model can answer. APTBench asks whether it can recognise the next move. That difference is small enough to fit inside a multiple-choice question and large enough to change a pre-training strategy.

For businesses, the message is direct. If you are building or buying agentic AI, stop treating general benchmark excellence as a passport into autonomy. It is a useful credential, not a work permit. The missing metric is agentic potential: the measurable ability of a base model to become a competent planner, actor, fixer, and evidence-user after post-training.

APTBench does not solve agent evaluation. It clarifies where the evaluation should begin. In a field that often discovers defects after the demo, that is already a respectable improvement.

Cognaptus: Automate the Present, Incubate the Future.


  1. Jiarui Qin, Yunjia Xi, Junjie Huang, Renting Rui, Di Yin, Weiwen Liu, Yong Yu, Weinan Zhang, and Xing Sun, “APTBench: Benchmarking Agentic Potential of Base LLMs During Pre-Training,” arXiv:2510.24397, 2025. ↩︎