A procurement team does not buy an AI agent because it can recite the word “interoperability” with theatrical confidence. It buys the agent because the thing can use tools, collect data, combine results, and stop before it bankrupts the token budget.

That is the useful way to read MCP-AgentBench, a new benchmark for evaluating language agents inside the Model Context Protocol ecosystem.1 The paper is not just another leaderboard with a fresh coat of protocol paint. Its more interesting result is harsher: MCP gives agents a common integration layer, but it does not make them competent tool users. Compatibility is plumbing. Competence is orchestration.

The evidence is mildly inconvenient, which is usually where the good part begins. Qwen3-235B-A22B achieves the highest overall score in the benchmark under ReAct, with an average pass rate of 64.7%. Claude 4 Sonnet is the best proprietary model under native tool calling, with 58.0%. Kimi K2 reaches 61.0% in tool-calling mode. Meanwhile, the same Qwen model that leads under ReAct drops to 40.2% under native tool calling, largely because it sometimes fails to issue tool calls when they are required. So much for the comforting idea that “native tool calling” is automatically the grown-up architecture.

The paper’s real contribution is not a single winner. It is a diagnostic frame for asking a better enterprise question: which model, under which orchestration style, for which class of tool workflow, at what token cost?

The result that should annoy simple leaderboards

MCP-AgentBench evaluates agents using 33 operational MCP servers and 188 distinct tools. The authors build 600 human-verified queries, evenly distributed across six interaction categories: single-server single-call, single-server parallel-call, single-server sequential-call, multi-server single-call, multi-server parallel-call, and multi-server sequential-call.

That design matters because most business workflows are not “call one API and smile”. A useful agent may need to choose the right server, call several tools independently, pass the output of one tool into another, and synthesize a final answer. Tool-use evaluation that ignores this structure is not exactly wrong. It is just overfitted to the toy aisle.

The main results expose three separable facts:

Finding Evidence in the paper Business meaning Boundary
Open models can outperform proprietary models in this setting Qwen3-235B-A22B reaches 64.7% average pass rate under ReAct Do not assume vendor tier predicts agent reliability This is one benchmark, on stateless text-based MCP servers
Tool-calling mode is not universally superior Qwen3 falls from 64.7% ReAct to 40.2% TC; Claude 4 rises from 49.2% ReAct to 58.0% TC Evaluate model × orchestration, not model alone Implementation details and parser compatibility can affect outcomes
Token cost changes the engineering answer Claude 4 Sonnet uses about 140.3k tokens/query; Kimi K2 about 101.7k; o3-mini reaches 50.0% at about 36.5k Best pass rate may not be best operating model Token counts are benchmark-specific, not a universal price sheet

This is why the paper is more useful as an operating manual than as a trophy ceremony. The headline is not “Qwen beats Claude” or “Claude beats GPT-4o”. The headline is that agent reliability depends on the interaction contract between the model and its tools. A leaderboard that collapses this into one model ranking is doing what dashboards often do: making executives feel informed while removing the variable that matters.

MCP-AgentBench tests workflows, not just function names

The benchmark starts with a practical curation problem. The authors initially identify 369 candidate MCP servers, then filter them down to 33 that are executable, stable, stateless, and primarily text-based. This filtering took three people seven days. That detail is not glamourous, which is precisely why it matters. Real tool environments are messy before the model even enters the room.

The selected servers cover domains such as utilities, news and trends, developer tools, maps and navigation, sports and gaming, finance and investment, travel and transportation, and search or web content. Across them, the benchmark exposes 188 tools. The goal is not to simulate every enterprise system. It is to create a controlled but operational testbed where agents must interact with real MCP-mediated tools rather than hand-crafted toy functions.

The six categories are the benchmark’s main intellectual structure:

Category What it tests Enterprise analogue
Single-server single-call Can the agent identify and call one appropriate tool? Fetch one exchange rate, weather value, ticket status, or account field
Single-server parallel-call Can it make independent calls within one domain? Compare several shipment statuses, calendar slots, or records
Single-server sequential-call Can it use one result to drive the next call? Search, filter, then retrieve details from the same system
Multi-server single-call Can it choose the right server among many? Route the task to finance, CRM, maps, HR, or support tooling
Multi-server parallel-call Can it gather independent evidence across systems? Pull pricing, availability, and location data in one workflow
Multi-server sequential-call Can it chain cross-domain dependencies? Use CRM data to trigger billing lookup, then produce a customer action plan

That ladder is the business value. The paper gives enterprises a template for replacing vague agent pilots with acceptance tests. Instead of asking whether a model “supports MCP”, ask whether it passes the task class your operation actually needs.

MCP compatibility tells you the adapter fits. It does not tell you the agent knows when to use the adapter, how many times to use it, or whether to trust the tool output over its own parametric memory. Apparently, the socket being standardised does not magically make the electrician competent. Tragic.

The evidence is main result first, validation second

The paper’s experimental evidence is worth separating by purpose, because not every figure is doing the same job.

Paper component Likely purpose What it supports What it does not prove
Table 1 main benchmark results Main evidence Relative model × orchestration performance across six MCP task categories Universal model superiority across all agent environments
Figure 4 difficulty analysis Main evidence / diagnostic analysis Tasks generally become harder with multi-server scope and sequential dependency That every model will degrade monotonically in every deployment
Figure 5 token consumption Practical efficiency analysis Pass rate must be interpreted with token cost Full production cost including latency, retries, hosting, and vendor pricing
Human-vs-MCP-Eval agreement Evaluation validation LLM judge broadly aligns with human majority on sampled items Perfect evaluator reliability across all domains
Appendix prompts and configuration Implementation detail ReAct and TC setups are inspectable and reproducible enough to interpret That prompt choices are uniquely optimal

This distinction matters. The main results tell us the benchmark differentiates agents. The validation result tells us the automated judge is plausible. The appendix tells us how the experiment was run. None of those should be inflated into a second thesis.

The judge, MCP-Eval, is outcome-oriented. It evaluates whether the final answer satisfies the user’s goal, not whether the agent followed the same trajectory as a reference solution. That is the correct bias for real workflows. A business user does not care whether the agent used Tool A then Tool B, if Tool C got the same verified answer with fewer steps. Process logs matter for audit and debugging; final outcomes matter for usefulness.

The authors use o3-mini-high as the LLM judge. On a sample of 60 items, MCP-Eval reaches 91.67% agreement with the human majority vote, with Cohen’s kappa of 0.734. Inter-rater reliability among three human experts reaches Fleiss’ kappa of 0.671, and full three-way agreement is 86.67%. This is not a licence to outsource all evaluation judgment forever. It is evidence that the judge is usable as a scalable benchmark instrument, with spot audits where decisions carry operational risk.

The model is not the unit of deployment

One of the paper’s most useful corrections is simple: “model performance” is the wrong phrase. The deployable unit is model × orchestration mode.

Under ReAct, Qwen3-235B-A22B leads the full benchmark. ReAct gives the model an explicit loop of thought, action, observation, and final answer. This style can help when tasks require planning and adaptation, especially across complex tool chains. It also keeps the model visibly engaged in the decision of whether and how to call tools.

Under native tool calling, Claude 4 Sonnet becomes the strongest proprietary model. That suggests the model is better aligned with the structured tool-call interface used in the test. Kimi K2 also performs strongly in TC mode, reaching the highest TC average overall at 61.0%.

Then comes the delightful architectural slap: Qwen3-235B-A22B collapses in TC mode. The paper attributes this often to failures to generate a tool call when one is needed, causing premature termination and an incorrect answer. This is not a small implementation footnote. It is the sort of failure that turns a demo into an incident ticket.

The business interpretation is blunt. Do not select an LLM for agentic workflows based on chat quality, general reasoning reputation, or a generic function-calling leaderboard. Test the exact interaction mode your product will use. If your stack relies on native tool calls, test native tool calls. If your stack relies on ReAct-style loops, test ReAct. If you use both, rank both.

An enterprise evaluation should therefore report results like this:

Workflow class Candidate orchestration What to measure
Simple deterministic lookup Native tool calling Tool-call issuance, argument accuracy, final answer correctness
Cross-system evidence gathering ReAct or TC with parallel call support Correct tool selection, coverage of all requested components
Multi-step dependency chain ReAct, planner-controller, or carefully tuned TC Step ordering, state carryover, omission rate
Cost-sensitive high-volume support Lower-cost model with escalation Pass rate per 1k tokens, retry yield, escalation frequency
Regulated or auditable workflow Tool calling with strong logs, plus outcome judge Traceability, evidence grounding, failure classification

That table is less exciting than a single model ranking. It is also much closer to how systems survive production.

The benchmark reveals tool-use failure modes executives should actually care about

The authors identify four recurring error categories: misinterpretation of the query, refusal to use tools, omission of key information, and hallucination.

These are not abstract model sins. They map directly to operating risks.

Misinterpretation is the agent version of bad requirements intake. The user asks for a task with constraints; the model solves a nearby task because nearby is comfortable. In production, this argues for intent checks, structured task decomposition, and confirmation rules when ambiguity affects cost or compliance.

Refusal to use tools is more dangerous than it sounds. It includes the model defaulting to parametric knowledge when the task requires current, external, proprietary, or user-specific data. This is how agents produce plausible answers from stale memory while standing next to the API that would have told them the truth. Very on-brand for machines trained on the internet.

Omission is the multi-step killer. The agent may call the right tools but fail to include all important results in the final answer. That failure is especially relevant in workflows such as compliance review, order resolution, logistics planning, and customer support escalation, where leaving out one field can change the decision.

Hallucination remains the familiar villain, but the MCP setting makes it more diagnosable. If the tool output does not contain a claim, the final answer should not present that claim as grounded. The engineering fix is not “please be more truthful”. It is evidence binding: every critical final-field should be traceable to either user input, tool output, or an allowed inference.

Token economics turn benchmark results into architecture decisions

The paper’s token-efficiency analysis is not a side note. It is the part finance will eventually discover, so engineering may as well get there first.

In TC mode, Kimi K2 achieves 61.0% pass rate with roughly 101.7k tokens per query. Claude 4 Sonnet reaches 58.0% with roughly 140.3k tokens per query. o3-mini reaches 50.0% with about 36.5k tokens per query. GPT-4o and DeepSeek V3 consume similar token ranges to o3-mini but perform lower in this benchmark.

The practical interpretation is not that o3-mini is “best”. It is that throughput architecture should not worship the maximum pass rate. A support operation processing thousands of low-risk requests may prefer a cheaper model with escalation. A financial compliance workflow with low tolerance for missed evidence may prefer a more expensive reasoning model. A multi-server sequential workflow may need a specialist mode. The benchmark gives the data structure for making that trade-off; it does not make the trade-off for you.

A reasonable enterprise design would route by task class:

  1. Use a cost-efficient model for simple single-call and low-risk parallel tasks.
  2. Escalate to stronger reasoning models when the workflow becomes sequential, multi-server, or failure-sensitive.
  3. Cap token budgets and action counts.
  4. Log tool-call issuance, missing-field rates, and final-answer evidence coverage.
  5. Evaluate pass rate per cost, not pass rate in isolation.

That is not glamorous. It is architecture.

MCP is infrastructure, not proof of agent reliability

The misconception to kill is that MCP adoption itself makes agents useful. MCP solves part of the integration problem: it gives agents a more standardised way to interact with tools and servers. That is valuable. The old M×N integration mess, where every agent-tool pair needs bespoke wiring, is not a lifestyle anyone should defend.

But MCP-AgentBench shows that once the doors are standardised, agents still have to decide which door to open, what to do inside, when to stop, and how to report the result. MCP reduces integration friction. It does not eliminate planning failure, tool avoidance, omission, hallucination, or token waste.

For business teams, the practical pathway is clear:

Paper result Cognaptus inference for business use Uncertainty boundary
Six-category benchmark structure Build internal acceptance tests by workflow type, not by generic “agent success” Internal tools may be stateful, multimodal, or permission-constrained
Outcome-first MCP-Eval Judge final task success while allowing multiple valid tool paths High-risk workflows still need human audit and deterministic checks
Model × mode variation Procure and deploy combinations, not model names Results may shift with prompts, tool schemas, APIs, and model updates
Token-performance trade-off Use routing and escalation instead of one-model absolutism Benchmark token counts are not full operational cost
Error taxonomy Instrument dashboards around concrete failure modes Taxonomy may need extension for enterprise-specific risks

The most important operational move is to build a smaller internal MCP-AgentBench-like suite. Use the same six categories, but replace public MCP servers with your real systems: CRM, ticketing, billing, ERP, logistics, analytics, document repositories, and policy knowledge bases. Then evaluate model × orchestration × budget under realistic permissions and data constraints.

Do not ask, “Can this model use MCP?” Ask:

  • Does it call the tool when the answer requires external data?
  • Does it choose the right server among many?
  • Does it preserve dependencies across sequential calls?
  • Does it synthesize all required fields?
  • Does it hallucinate outside tool evidence?
  • How many tokens and retries does success cost?
  • Which failures are recoverable through routing, and which require product redesign?

That is where a benchmark becomes an operating discipline.

The boundaries are narrow enough to be useful

The paper’s limitations are not fatal, but they matter.

First, the benchmark uses stateless servers. That improves reproducibility, but many enterprise workflows are stateful: ticket updates, order changes, approval chains, customer histories, ongoing negotiations. Rankings may change when agents must maintain and mutate persistent state.

Second, the benchmark is text-based. MCP can support broader interaction patterns, but this evaluation deliberately avoids non-textual modalities because they are harder to judge quantitatively. GUI agents, document-heavy workflows, voice interfaces, and multimodal inspection tasks need separate evaluation.

Third, MCP-Eval is an LLM-as-judge method. Its agreement with human evaluators is strong enough to support scalable benchmarking, but it should not be treated as a legal oracle, compliance officer, or CFO. For high-stakes workflows, use automated judging for screening and trend tracking, then audit samples and edge cases.

Fourth, the tool ecosystem is selected. The authors filtered from 369 candidate servers to 33 operational, stable, stateless, text-based servers. That curation is necessary, but it also means the benchmark measures a disciplined slice of the MCP world, not the entire glorious swamp.

These boundaries make the paper more credible, not less. A benchmark that claims to measure everything usually measures executive impatience.

What MCP-AgentBench really measures

MCP-AgentBench measures whether an agent can turn a standardised tool environment into a successful task outcome. That sounds modest. It is not.

The paper’s evidence says three things businesses should internalise before building MCP-heavy systems.

First, protocol standardisation is necessary but not sufficient. MCP can reduce adapter chaos, but it does not make agents reliable planners.

Second, the orchestration layer is part of model performance. ReAct, native tool calling, and any hybrid controller should be evaluated as first-class design choices.

Third, cost is not an afterthought. A model that performs well by consuming 140k tokens per query may be appropriate for difficult, high-value tasks. It is not automatically the default engine for every workflow unless your budget enjoys interpretive dance.

The mature takeaway is not “open-source wins”, “Claude wins”, or “tool calling wins”. The mature takeaway is that agent deployment has moved from model selection to operating-mode selection. MCP-AgentBench gives teams a vocabulary for that shift: servers, tools, task categories, pass rates, judge agreement, failure modes, and token cost.

That is what makes the paper useful. It does not promise protocol peace. It shows where the tool wars actually moved.

Cognaptus: Automate the Present, Incubate the Future.


  1. Zikang Guo, Benfeng Xu, Chiwei Zhu, Wentao Hong, Xiaorui Wang, and Zhendong Mao, “MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools,” arXiv:2509.09734, 2025. ↩︎