Every engineering team has met this problem: the useful data exists, but it lives in thirteen different shapes, three different tool conventions, two incompatible logs, and one heroic spreadsheet that nobody dares to open.

AI agents have the same disease, only with more acronyms.

The paper behind the Agent Data Protocol, or ADP, argues that large-scale supervised fine-tuning of AI agents has been held back less by a lack of data than by a lack of shared representation.1 Agent datasets already exist for coding, software engineering, web browsing, API use, operating-system interaction, and general tool use. The difficulty is that each one tends to encode actions, observations, tool calls, web states, messages, and execution feedback in its own local dialect. Naturally, every dataset is special. How convenient for nobody.

ADP’s contribution is to treat agent learning as an interoperability problem before treating it as a modelling problem. It introduces a lightweight Pydantic-based schema that converts heterogeneous agent trajectories into typed sequences of actions and observations. Once those trajectories are standardised, they can be exported into multiple agent harnesses for supervised fine-tuning.

That sounds almost boring. It is not. In infrastructure, boring is usually where the compounding returns hide.

ADP is not a new agent; it is a translation layer for agent behaviour

The easy misconception is to read ADP as another agent model, benchmark, or heroic “agent framework” arriving to save civilisation from YAML. It is none of those. ADP is closer to an interlingua: a shared intermediate representation that sits between raw agent datasets and downstream training formats.

The paper’s core abstraction is a Trajectory. A trajectory contains an ordered interaction history between an agent and its environment. Inside that history, ADP uses two major families of objects:

ADP object family Main types What it captures
Action API action, code action, message action What the agent decides or emits
Observation text observation, web observation What the user, tool, browser, or environment returns

This is the mechanism. A messy raw trace from a coding dataset may contain a user prompt, an assistant message with a Python code block, an execution result, and a final answer. ADP splits that into a user text observation, a code action, an environment text observation, and a message action. A web-browsing trace can similarly become API actions such as navigation or clicking, plus web observations containing HTML, accessibility-tree data, URL state, viewport information, and optionally screenshot data.

The point is not aesthetic tidiness. The point is that many forms of agent behaviour can be decomposed into the same basic loop:

agent acts → environment responds → agent acts again.

Once the loop is typed, it can be analysed, validated, mixed, sampled, and re-exported. That is the quiet move from “dataset collection” to “data infrastructure”.

The hub-and-spoke mechanism is the actual business idea

The paper’s most business-relevant figure is not a leaderboard. It is the conversion diagram.

Without ADP, each dataset needs a custom conversion into each agent harness. If there are $D$ datasets and $A$ agent formats, the conversion burden is roughly:

$$ O(D \times A) $$

With ADP, each dataset is converted once into ADP, and each harness gets one converter out of ADP:

$$ O(D + A) $$

This is the hub-and-spoke argument. ADP becomes the hub. Raw datasets and agent frameworks become spokes.

That difference matters because agent systems are not stable objects. New datasets appear. New harnesses appear. Tool interfaces mutate. Browsers, shells, API wrappers, and code-execution environments all insist on being slightly different from one another, because apparently society needed that too.

The paper quantifies the engineering proxy using lines of converter code. The authors report about 4,892 lines of code to convert 13 datasets into ADP. They then report ADP-to-SFT converters of about 150 lines for OpenHands CodeActAgent, 50 for SWE-Agent, and 30 for AgentLab, or about 77 lines on average. Their illustrative calculation is simple: for 100 harnesses, direct dataset-to-harness conversion would be roughly 489,200 lines, while the ADP route would be roughly 12,592 lines.

Lines of code are an imperfect measure of cost. Some 50-line converters contain more pain than an entire dashboard. Still, the comparison usefully captures the shape of the problem: ADP changes conversion from repeated bespoke integration into reusable adapter work.

For enterprises, that is the practical pathway. The value is not “use ADP and your agent gets smarter by magic”. The value is that standardised trajectories make it easier to reuse internal workflow traces across multiple agent products, training pipelines, and evaluation setups without rebuilding the data plumbing every time.

The dataset contribution is large, but the sampling details matter

The authors convert 13 existing agent datasets into ADP and release what they call ADP Dataset V1, containing 1.3 million training trajectories. The sources span several categories: coding, software engineering, API/tool use, and web browsing. They include datasets such as AgentInstruct, Code-Feedback, CodeActInstruct, Go-Browse, Mind2Web, Nebius SWE-Agent trajectories, NNetNav, OpenHands feedback, Orca AgentInstruct, SWE-Gym, SWE-smith, and Synatra.

The cross-dataset analysis is not the headline result, but it is useful. It shows that the standardised representation is expressive enough to expose differences across datasets. Average trajectory length ranges from 1.0 round in Synatra to 26.8 rounds in SWE-smith, with an overall average of 10.1. Action mix also differs sharply: web datasets lean heavily toward API actions, coding datasets toward code actions, and software-engineering datasets mix API and code actions.

This is where ADP becomes more than a file format. Once trajectories are normalised, dataset comparison becomes possible. You can ask whether one source is mostly web navigation, another mostly code execution, another mostly message-heavy tool instruction. Before standardisation, those comparisons are trapped inside custom parsers and dataset folklore.

One detail deserves attention: the experiments do not simply throw all 1.3 million trajectories into every model. Appendix C describes balanced sampling and domain-specific filtering. Large sources such as Orca AgentInstruct and Synatra are heavily downsampled; under-represented sources such as SWE-Gym are upsampled. For OpenHands and SWE-Agent, the authors use the non-web portion of the ADP corpus, around 30,000 training samples. For AgentLab, they use the web portion, around 20,000 samples.

That matters. The paper is not proving that indiscriminate mixture always works. It is showing that a standard protocol makes careful mixture construction easier.

The main evidence: ADP-trained agents improve across harnesses and tasks

The experimental design is straightforward. The authors fine-tune Qwen2.5-Coder-Instruct models at 7B, 14B, and 32B scale using ADP-converted data, then evaluate across different agent harnesses and benchmarks. The harnesses include OpenHands, SWE-Agent, and AgentLab. The benchmarks include SWE-Bench Verified for software engineering, WebArena for web browsing, AgentBench OS for operating-system-style tool use, and GAIA for general assistant tasks.

The headline results are substantial. On SWE-Bench Verified, the 7B Qwen2.5-Coder model with SWE-Agent rises from 0.4% to 20.2% after ADP training. With OpenHands, the 7B model rises from 2.8% to 20.4%. At 14B, the SWE-Agent setup reaches 34.4%, and OpenHands reaches 30.6%. At 32B, SWE-Agent reaches 40.3%, while OpenHands reaches 36.8%.

On WebArena with AgentLab, ADP-trained Qwen2.5-Coder improves from 4.5% to 21.0% at 7B, from 5.5% to 22.2% at 14B, and from 10.9% to 22.9% at 32B. On AgentBench OS with OpenHands, the 7B model improves from 3.5% to 27.1%, the 14B model from 2.8% to 20.8%, and the 32B model from 27.8% to 34.7%. GAIA shows a smaller gain: Qwen2.5-7B-Instruct with OpenHands rises from 7.3% to 9.1%.

A compact reading of the evidence looks like this:

Evidence item Likely purpose What it supports What it does not prove
Tables 3–5: 7B, 14B, 32B benchmark results Main evidence ADP-trained models improve over corresponding base models across selected harnesses and benchmarks Universal gains across all models, all tasks, or all enterprise workflows
Table 6: mixed ADP vs task-specific tuning Comparison with task-specific training Diverse ADP mixtures can outperform single-domain tuning under matched harness/model settings That every mixed dataset is better than every specialised dataset
Figures 3–4: scaling plots Evidence visualisation / scaling summary ADP-trained models outperform base models at each tested size; gains are visible across scales A full scaling law or saturation forecast
Table 7–8: converter LOC Implementation evidence ADP reduces repeated conversion work through hub-and-spoke adapters Total integration cost in production environments
Appendix C: sampling and filtering Implementation detail The training corpus is balanced and domain-filtered rather than naively pooled An exhaustive optimisation of data mixture weights
Appendix E.1: equal-scale comparison Robustness / sensitivity test ADP’s advantage is not only due to more samples in one SWE-Bench comparison A complete ablation over all datasets and all mixture designs

The important interpretation is not just “ADP improves scores”. The mechanism says: standardisation enables broader, cleaner, reusable training mixtures; those mixtures teach agents more transferable behaviours than narrow task-specific data in several tested settings.

Cross-task transfer is the more interesting result than another leaderboard bump

The most revealing comparison is not base model versus ADP-trained model. That comparison is useful, but unsurprising: fine-tuning on relevant agent trajectories should help agents act more like agents. Stunning, yes. Alert the committee.

The sharper question is whether a mixed ADP corpus beats task-specific fine-tuning. Table 6 addresses this directly.

On SWE-Bench with OpenHands, Qwen2.5-7B-Instruct trained only on SWE-smith reaches 1.0%, while ADP training reaches 10.4%. For Qwen-3-8B, CodeActInstruct plus Code-Feedback reaches 0.2%, SWE-smith alone reaches 11.0%, and ADP reaches 16.6%. On WebArena, Go-Browse-only tuning reaches 16.0%, while ADP reaches 20.1%. On AgentBench OS, AgentInstruct-only tuning reaches 21.5%, while ADP reaches 25.7%. On GAIA, AgentInstruct-only training reaches 0.6%, while ADP reaches 9.1%.

The authors also run an equal-scale check in Appendix E.1. They upsample SWE-smith to match roughly 30,000 training examples and compare it with ADP on SWE-Bench using OpenHands and Qwen-3-8B. SWE-smith reaches 11.0%; ADP reaches 16.6%.

This matters because it weakens the most obvious alternative explanation: perhaps ADP wins only because it uses more data. In that test, the data scale is comparable, yet the mixed ADP corpus still performs better. It does not prove the optimal mixture, but it supports the claim that diversity and unified structure contribute, not just volume.

For business readers, this is the strategic lesson. A narrow dataset may teach an agent to imitate one workflow. A well-curated mixed trajectory corpus may teach reusable action patterns: inspect, call, execute, observe, revise, finish. That is closer to operational competence.

The real enterprise analogue is not public benchmarks; it is workflow trace reuse

Cognaptus inference: ADP points toward a practical enterprise architecture for agent training and improvement.

Many organisations already generate agent-like data without calling it that. Support teams escalate tickets through tools. Analysts query databases, open dashboards, write notes, and revise outputs. Developers inspect logs, run commands, patch files, and test results. Operations teams move between forms, documents, APIs, approvals, and exception handling.

The problem is that those traces are usually locked inside application logs, chat transcripts, ticket histories, browser recordings, database audit trails, and workflow platforms. They are not immediately useful for training agents because they do not share a behavioural grammar.

ADP suggests a route:

  1. Map internal workflow traces into typed actions and observations.
  2. Preserve provenance, tool metadata, environment feedback, and human corrections.
  3. Validate structure before training.
  4. Build domain-filtered training mixtures for specific agent harnesses.
  5. Export the same standardised corpus into multiple downstream agents.

The return is not only model accuracy. It is operational reusability. A bank’s compliance-review traces, a logistics company’s exception-resolution traces, or a software company’s incident-response traces could become reusable agent training assets if represented consistently.

That is the business relevance pathway, with an important boundary: the paper shows this on public datasets, Qwen-family models, three open agent harnesses, and selected research benchmarks. Enterprise adoption adds harder problems: proprietary tool schemas, privacy, licences, redaction, data quality, role permissions, auditability, and domain-specific failure costs. The protocol lowers integration friction. It does not remove governance. Unfortunately, governance remains stubbornly employed.

The limitation is not that ADP is simple; the limitation is what simplicity cannot capture yet

ADP’s simplicity is a feature, but it also defines its boundary.

First, the schema is designed around actions and observations that can be represented as API calls, code blocks, messages, text feedback, and web states. That covers a broad range of current agent work, but the authors themselves point to multimodality as future work. Rich screen recordings, images, audio, sensor streams, and complex UI-state histories may need extensions.

Second, the results depend on the quality and licence status of source datasets. The paper includes a licensing appendix and warns users to verify current terms. This is not a decorative appendix. Any business thinking about trajectory reuse must treat licensing and data provenance as part of the infrastructure, not a legal chore at the end when everyone is tired and optimistic.

Third, the paper’s experiments are strong but not exhaustive. They do not establish a universal best mixture strategy. Appendix C explicitly leaves future work on sampling multipliers and per-dataset effects. ADP makes mixture experimentation easier; it does not solve mixture optimisation automatically.

Fourth, benchmark performance is not the same as deployment reliability. SWE-Bench, WebArena, AgentBench OS, and GAIA are meaningful research tests, but enterprise workflows include hidden constraints: latency, permissions, UI drift, incomplete logs, exception handling, compliance review, and the charming habit of internal systems to fail silently at 4:58 p.m.

What this paper changes is the unit of competition

The usual agent story focuses on smarter models, better tools, larger contexts, and more elaborate planning loops. ADP shifts attention to a less glamorous unit: the trajectory format.

That is a useful correction. Agent performance is not only a model problem. It is a data supply-chain problem. If action histories cannot be combined, compared, sampled, validated, and exported, then every agent project becomes a boutique integration exercise. Boutique is lovely for hotels. Less so for training infrastructure.

ADP’s strongest claim is not that it is the final universal standard for all agent data. It is that agent training becomes more scalable when the ecosystem has a common behavioural grammar. The experimental results make that claim credible: large gains over base models, stronger mixed-data performance than several task-specific baselines, and a practical reduction in conversion effort.

For companies building agent systems, the lesson is direct. Do not start by asking only which model to fine-tune. Ask whether your organisation can represent its agent workflows in a reusable format. Ask whether tool calls, code execution, browser states, user feedback, and final responses can be captured as typed trajectories. Ask whether today’s logs can become tomorrow’s training data without hiring a new team of parser archaeologists.

The Esperanto metaphor is apt, with one caveat. Esperanto did not unify humanity. ADP will not unify the agent ecosystem by itself either. Standards win when they become useful enough that everyone is too lazy not to use them.

That, in technology, is often how civilisation advances.

Cognaptus: Automate the Present, Incubate the Future.


  1. Yueqi Song et al., “Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents,” arXiv:2510.24702v2, published as an ICLR 2026 conference paper, https://arxiv.org/html/2510.24702↩︎