TL;DR for operators

Most enterprise data work is not blocked by a lack of models. It is blocked by orchestration.

A company may already have Spark, Pandas, SQL engines, notebooks, dashboards, semantic layers, data lakes, vector stores, ETL jobs, monitoring tools, and a growing pile of LLM wrappers. The awkward part is deciding which tool should act, in what order, on which data, under which assumptions, and how to recover when the first plan fails. This is the gap the Data Agent paper tries to formalise.1

The paper’s useful idea is not “let an LLM run the data department”. That would be a nice way to produce expensive nonsense with a confident accent. The useful idea is a layered architecture in which agents perceive tasks and data, select tools and specialist agents, build pipelines, execute them, store intermediate results, and revise the plan when execution breaks.

The most concrete part is iDataScience, a proposed multi-agent system for data science tasks. It has an offline stage that discovers and benchmarks data skills, then an online stage that embeds a new task, selects agents by similarity-weighted benchmark results, decomposes the task into sub-tasks, and executes the resulting pipeline with dynamic refinement.

For operators, the near-term value is routing and coordination: fewer brittle handoffs, better tool selection, richer metadata reuse, and more systematic diagnosis. The boundary is equally clear. This paper is mainly an architecture and systems agenda. It does not provide broad production evidence that agents can replace data engineers, analysts, or DBAs end-to-end. The correct business reading is: Data Agents are a possible control plane for enterprise data work, not a magic intern who can be left alone with production credentials and a Red Bull.

The real bottleneck is not extraction, transformation, or loading

The familiar enterprise data pipeline has always had a slightly theatrical simplicity: extract the data, transform it, load it somewhere, and pretend the hard part is over.

In reality, the hard part comes afterwards. A business user asks a question whose meaning depends on context. The relevant data sits across structured tables, unstructured documents, logs, dashboards, and possibly a lake that is less a lake than a swamp with invoices. A data scientist has one tool for modelling, an analyst has another tool for reporting, a DBA has monitoring scripts, and a platform team has its own orchestration layer. The work proceeds through handoffs, judgement calls, ad hoc scripts, and long Slack threads that should probably be classified as operational debt.

The Data Agent paper argues that traditional Data+AI systems remain too dependent on human experts for exactly this reason. Existing systems can optimise pieces of the pipeline: index selection, cardinality estimation, data cleaning, feature management, model inference, query rewriting, and so on. But the adaptation problem remains. When data, tasks, tools, and environments change, someone still has to reason about the pipeline.

That is where the paper positions the Data Agent: not as another model, but as an orchestration architecture for data-related tasks. The agent’s job is to understand tasks, understand data, understand tools, plan workflows, execute them, and learn from failures.

That framing matters. If the reader treats this as “LLMs for ETL”, the paper sounds like a prettier automation wrapper. If the reader treats it as a proposed control plane for Data+AI ecosystems, the architecture becomes more interesting.

A Data Agent is a control stack, not a chatbot with database access

The paper defines a Data Agent around six key capabilities: perception, reasoning and planning, tool invocation, memory, continuous learning, and multi-agent collaboration.

These are not decorative modules. Each one corresponds to a failure mode in real data work.

Data Agent capability Operational problem it addresses Business meaning Boundary
Perception The system must understand the task, data, tools, agents, and environment Better interpretation of messy natural-language requests and heterogeneous data contexts Perception can still misread ambiguous requirements or poor metadata
Reasoning and planning Complex data tasks require multi-step workflows Less manual pipeline design and fewer hardcoded chains Planning quality depends on task decomposition and available tool profiles
Tool invocation Different tasks require different engines and operators Existing tools can be reused instead of replaced Bad tool selection can make the agent slower, wrong, or both
Memory Useful context exists across user history, domain knowledge, and intermediate results More continuity across repeated workflows Memory introduces governance, privacy, and freshness problems
Continuous learning Agents need to improve after failures Potential reduction in repeated manual debugging Reward design and self-reflection remain open research issues
Multi-agent coordination No single agent is best at every data task Specialist agents can be composed for complex workflows Coordination adds failure surfaces and monitoring complexity

The mechanism is therefore closer to an operating layer than an interface layer. A chatbot answers a question. A Data Agent is supposed to decide how a question should become a pipeline.

That distinction is where the paper earns its keep. The authors are not merely saying, “LLMs can understand data.” They are proposing a system in which semantic understanding is connected to data catalogs, engines, benchmarks, agent profiles, execution plans, and feedback loops.

In enterprise terms, this is the difference between a friendly assistant that writes SQL and an orchestration layer that knows when SQL is the wrong tool.

The architecture starts with semantic organisation, because agents cannot orchestrate what they cannot find

The first layer of the proposed stack is data understanding and exploration. This includes semantic catalogs, metadata indexes, semantic organisation, data fabric, and strategies for using preparation, cleaning, and integration tools.

This is less glamorous than multi-agent planning, but it is more important. An agent cannot choose the right pipeline if it has no reliable map of the data estate. Without semantic organisation, an LLM is effectively being asked to navigate a warehouse in the dark while holding a very expensive flashlight.

The paper’s architecture assumes that data must be made discoverable and interpretable before orchestration becomes feasible. That means schemas, metadata, semantic indexes, and links across heterogeneous data sources matter. The agent layer is not a replacement for data governance. It is a new consumer of it.

The business implication is blunt: companies with poor data catalogs should not expect Data Agents to rescue them. Agents amplify the quality of the environment they operate in. If the environment is undocumented, duplicated, stale, and politically guarded by six teams, the agent will inherit the mess. Possibly at machine speed. Progress.

Engine and tool profiling turns orchestration into a selection problem

After the data layer comes the engine layer. The paper names familiar execution environments: Spark, DBMSs, Pandas, PyData, and other data processing tools. The challenge is not that these tools do not exist. The challenge is that they have different strengths, costs, interfaces, and failure modes.

A Data Agent therefore needs to profile tools and agents. It must know not only what a tool is called, but what it is good for, what inputs it expects, what outputs it produces, and when it is likely to fail.

This is where the paper’s reference to tool invocation protocols matters. Standardised interfaces such as MCP and agent-to-agent protocols are treated as plumbing for exchanging information and state between agents and tools. The plumbing is not the strategy, but without plumbing, strategy becomes a pile of wrappers.

The core mechanism is simple:

  1. Understand the user task.
  2. Map the task to required data capabilities.
  3. Identify tools or agents that can perform those capabilities.
  4. Generate a pipeline.
  5. Execute, inspect, and revise.

This turns enterprise data automation from “write another workflow” into “select and coordinate capabilities”. That is a meaningful shift. It is also where the paper moves beyond traditional ETL: the pipeline is not fixed before execution. It is planned, tested, and adjusted.

iDataScience makes the architecture concrete

The paper’s most developed example is iDataScience, a multi-agent system for data science tasks. It is useful because it translates the broad architecture into a two-stage mechanism: offline benchmarking and online orchestration.

The offline stage builds knowledge about data skills and agent capabilities. The online stage uses that knowledge to solve a new task.

iDataScience stage Mechanism Likely purpose in the paper What it supports What it does not prove
Offline skill discovery Extract data skills from a corpus of data science examples using LLMs Main architectural mechanism A way to represent data science work as composable capabilities That the extracted skill taxonomy is universally correct
Skill hierarchy construction Cluster semantically similar skills into a recursive hierarchy Implementation detail Better benchmark coverage and scenario customisation That clustering will be stable across domains
Skill-based benchmark generation Sample important skills and generate test cases with evaluation functions Main design contribution More adaptive benchmarking than fixed task lists That LLM-generated benchmarks are always realistic
Task embedding Align task embeddings with solution embeddings through contrastive learning Implementation detail for agent selection Similar tasks can be matched by required reasoning procedure That embeddings fully capture task complexity
Adaptive agent selection Weight benchmark cases by similarity to the online task Main orchestration mechanism More relevant agent selection than raw average benchmark scores That the selected agent will always succeed
Pipeline execution and refinement Execute sub-tasks in dependency order, revise failures locally or globally Robustness mechanism A recovery path when intermediate execution fails Production-grade reliability under high-stakes conditions

The clever part is not that iDataScience uses agents. Everyone now uses agents, apparently including people who once used three prompts and a spreadsheet macro. The clever part is that it tries to make agent selection conditional on the task.

A generic benchmark score is a crude instrument. An agent that performs well on one class of data science task may be mediocre on another. The paper therefore proposes similarity-weighted benchmark aggregation: compare the online task with benchmark test cases, weight the relevant cases more heavily, and select the agent whose past performance best matches the current requirement.

That is the right instinct. Enterprise orchestration should not ask, “Which agent is best?” It should ask, “Which agent is best for this task, with this data, under this execution constraint?”

Benchmarking becomes a skill map, not a leaderboard

The paper criticises existing data science benchmarks for being constrained to predefined task types. Its alternative is a skill-based benchmark.

The workflow is roughly this: collect data science examples from sources such as online forums, competitions, and existing benchmarks; use LLMs to summarise solutions; filter low-quality examples; extract associated data skills; cluster those skills into a hierarchy; assign importance weights; then generate benchmark test cases by composing selected skills.

This is a meaningful departure from leaderboard thinking. A leaderboard ranks systems against a fixed test set. A skill map tries to describe the capability requirements behind tasks. For agent orchestration, that distinction matters. The orchestrator does not merely need to know which agent scored higher last month. It needs to know which agent can handle missing-value treatment, feature derivation, filtering, regression, image-table alignment, or whatever other combination the current job requires.

The paper’s benchmark construction also includes executable evaluation functions that produce scalar scores. This is important because agent outputs are otherwise difficult to compare consistently. Where evaluation standards are too ambiguous for direct code, the paper allows LLMs inside the evaluation function under generated rules.

That last detail is useful but dangerous. LLM-based evaluation can expand coverage into fuzzy tasks, but it can also import model bias and inconsistency into the benchmark itself. The paper recognises benchmark design as an open challenge, and it should. A benchmark generated and judged partly by LLMs is still a benchmark, but it is not automatically a court of law. More like a tribunal with excellent grammar.

Task embeddings are used to select agents, not to make the system “semantic” in the vague sense

The paper’s task embedding method deserves attention because it addresses a subtle problem.

A naive embedding of a data science task may focus on irrelevant surface features: domain wording, dataset formatting, or representation style. Two tasks may look similar textually but require different operations; two others may look different but require the same reasoning procedure.

To reduce that mismatch, iDataScience aligns task embeddings with embeddings of correct solutions. The idea is that a good task representation should reflect the procedure needed to solve the task, not just the words used to describe it. The authors propose a contrastive learning setup: correct task-solution pairs are positives, while solutions involving mutually exclusive skills serve as negatives.

This is an implementation detail, but it carries a larger lesson. Enterprise agent selection should be based on operational similarity, not linguistic similarity alone.

That matters for procurement and deployment. Many organisations evaluate agents using charming demos. A user asks a natural-language question, the agent produces a plausible answer, and everyone nods until the invoice arrives. A better evaluation asks whether the agent can perform the required class of operations under realistic data constraints. The paper’s task embedding mechanism is a step toward that more serious evaluation posture.

Agent selection has three modes because production has three kinds of urgency

The paper gives three approaches for selecting heterogeneous data agents.

Selection method Paper’s stated trade-off Operator reading
Adaptive benchmark aggregation Medium accuracy, low online overhead, high offline overhead Best when the organisation can afford upfront evaluation and wants fast routing later
Agent document analysis Low accuracy, medium online overhead, low offline overhead Useful for onboarding a new agent before full benchmarking is complete
Task sample experiment High accuracy, high online overhead, low offline overhead Useful when the task is expensive or risky enough to justify trial execution

This table is not an experimental result. It is a design trade-off table. Its value is operational: it acknowledges that selection methods have different cost profiles.

A mature enterprise system would probably use all three. Benchmark aggregation for known agents. Document analysis for newly integrated tools. Sample experiments for high-cost workflows where choosing the wrong agent would be worse than spending extra time upfront.

The paper’s taxonomy therefore reads less like an academic flourish and more like a deployment checklist. Before asking an agent to act, decide how much confidence the decision requires and how much overhead the business can tolerate.

Pipeline planning is where “agentic” stops being a slogan

The iDataScience online stage decomposes a task into sub-tasks, assigns agents, executes them according to dependencies, and revises the plan when results fail validation.

This is the heart of the Data Agent concept. A complex data science request may involve cleaning, joining, feature construction, modelling, evaluation, and explanation. One agent may be strong at statistical modelling; another may be better at data wrangling; another may be designed for visualisation. The orchestrator must break the task into pieces that match these capabilities.

The paper proposes agent-oriented task planning. The LLM receives the task and a set of agent profiles, then generates sub-tasks aligned with available agents. The plan is checked for completeness and redundancy. Sub-tasks are represented in a dependency graph. Agent selection is applied to each sub-task. If no suitable agent is found, the sub-task can be decomposed further. If correlated sub-tasks can be handled together, they can be merged.

The mechanism is worth spelling out because it is where the enterprise value lives. Static automation assumes the workflow is known. Data Agent orchestration assumes the workflow must be inferred.

That does not make the agent omniscient. It makes the planning layer adaptive. The difference is not semantic hair-splitting. It determines whether a system can handle only yesterday’s approved workflow or can construct tomorrow’s workflow from reusable capabilities.

Execution is not complete until failure is part of the design

The paper’s execution model includes parallel bottom-up execution along a dependency graph. Once dependent sub-tasks are completed, downstream sub-tasks run. Intermediate results are checked by LLMs. If something fails, the system can refine the pipeline at two levels.

At the agent level, the system may rephrase an ambiguous sub-task, search for missing inputs, fix intermediate formatting, or switch to the next-best agent. At the global level, if local repair fails, the system replans the full pipeline while preserving already computed intermediate results in the data catalog.

This is an important detail. Failure recovery is not treated as an afterthought. It is part of the architecture.

For business readers, this is one of the most practical ideas in the paper. In real data work, intermediate outputs are often valuable even when the full workflow fails. A human analyst keeps useful extracts, intermediate tables, partial joins, and diagnostic notes. The paper’s data catalog performs a similar role for the agent system: outputs become new datasets with metadata, reducing redundant computation during replanning.

This is also where governance becomes unavoidable. If every intermediate result becomes a reusable dataset, someone must care about lineage, permissions, expiry, quality, and privacy. Otherwise the organisation has not built a memory layer. It has built a haunted attic.

Data analytics agents extend the idea from notebooks to semantic operators

The paper then generalises beyond iDataScience into data analytics agents. These include agents for unstructured data analytics, semantic structured data analytics, data lake analytics, and multi-modal analytics.

The common mechanism is semantic query decomposition. A natural-language query is translated into a pipeline of semantic operators: semantic filtering, semantic grouping, semantic sorting, semantic projection, semantic joining, and similar operations. Each logical semantic operator can map to multiple physical execution options, such as LLM calls, pre-programmed functions, or generated code.

This distinction between logical and physical operators is quietly important. It means the agent can reason at the level of user intent while still optimising execution cost and accuracy underneath.

For structured databases, the paper discusses semantic SQL: extending SQL-like processing with LLM-powered semantic operators. A query may require open-world knowledge or semantic extraction that traditional closed-world databases do not support. The agent can combine traditional operators with semantic operators, replacing some semantic calls with deterministic operations where possible and using embedding filters or smaller LLMs to reduce cost.

This is exactly the kind of hybrid architecture enterprises should prefer. Use expensive semantic reasoning where it is necessary. Replace it with cheaper deterministic computation where it is not. The future of enterprise AI is not “LLMs everywhere”. It is “LLMs where the marginal semantic value exceeds the marginal chaos”.

The DBA Agent is about diagnosis, not replacing database administration

The DBA Agent example moves the architecture into database operations. The paper describes a system that extracts knowledge from diagnostic documents, retrieves relevant tools and prompts, performs root cause analysis using tree search, and optimises execution pipelines.

The paper states that the DBA Agent significantly outperforms traditional methods and GPT-4 on previously unseen database anomalies. However, within this overview paper, that claim is not unpacked with detailed metrics. It should therefore be read as reported evidence from the authors’ related DBA work, not as the central empirical basis of this paper.

The practical direction is still clear. Database incidents are often time-sensitive, documentation-heavy, and diagnostic rather than creative. That makes them plausible candidates for agent assistance. A DBA Agent can search documentation, match symptoms, propose root causes, and generate structured reports.

But “assistance” is the key word. In production settings, diagnosis affects availability, revenue, and sometimes contractual obligations. The reasonable deployment path is human-supervised triage first: let the agent gather evidence, rank hypotheses, and draft remediation plans. Let a qualified operator approve actions until the system has earned a more dangerous level of trust.

Autonomous database surgery is not where one begins. Unless one enjoys postmortems.

What the paper directly shows, and what Cognaptus infers

The paper is broad, so interpretation requires separation between direct contribution and business inference.

Layer What the paper directly provides Cognaptus business inference What remains uncertain
Data Agent concept A holistic architecture for perception, planning, tools, memory, learning, and multi-agent coordination Data Agents can be understood as a control plane above fragmented data systems Whether this architecture can be implemented reliably across messy enterprise environments
iDataScience A detailed design for skill-based benchmarking, task embedding, agent selection, and adaptive pipeline orchestration Enterprises should evaluate agents by task capability and route work dynamically How well generated benchmarks and embeddings perform at scale
Analytics agents Examples of semantic operators over structured, unstructured, lake, and multi-modal data Semantic layers may evolve from passive metadata into active execution planning layers Cost, latency, correctness, and governance for semantic operators remain hard
DBA Agent A diagnostic agent design with reported superiority over traditional methods and GPT-4 on unseen anomalies Operations teams can use agents for evidence gathering and root-cause analysis The overview paper does not provide detailed metrics for independent evaluation
Challenges Open issues in theoretical guarantees, reflection, benchmarks, privacy, scalability, and performance Production deployment should start with bounded workflows and human approval Full autonomy remains a research target, not a procurement checkbox

This separation matters because the paper is ambitious. Ambition is useful in research and hazardous in vendor decks.

The safe reading is not that Data Agents are ready to take over enterprise data work. The safe reading is that data automation is moving from task execution toward task orchestration. The business opportunity is to build systems that know what they are doing with the tools they already have.

The business value is cheaper coordination, not cheaper headcount

The obvious but lazy interpretation is that Data Agents reduce staffing. That may happen in narrow workflows, but it is not the strongest argument.

The stronger argument is that Data Agents reduce coordination cost.

In enterprise data work, coordination cost appears everywhere: translating business questions into technical tasks, identifying relevant datasets, choosing execution tools, validating intermediate outputs, debugging failed pipelines, rerunning jobs, documenting decisions, and handing work between analysts, engineers, scientists, and operators.

A Data Agent architecture targets this coordination layer. If it works, it can make existing teams faster before it makes them smaller. It can also make specialist expertise more reusable. A DBA’s diagnostic playbook, a data scientist’s modelling workflow, and a data engineer’s cleaning logic can become components in an orchestrated system rather than private craft knowledge trapped in tickets and notebooks.

This is why the paper’s emphasis on benchmarking, agent profiles, and catalogs is commercially relevant. The agent is only as useful as its map of available capabilities.

For a business, the first serious implementation would likely look boring:

  • route analytics requests to the right tool or agent;
  • decompose recurring data science workflows into auditable sub-tasks;
  • generate and store intermediate datasets with metadata;
  • use semantic operators only where deterministic operators fail;
  • support DBAs with diagnosis reports rather than autonomous remediation;
  • evaluate agents by task class, not by generic demo quality.

Boring is good. Boring is how systems survive procurement, compliance, and Monday morning.

Where the architecture still bends under production pressure

The paper’s own challenge section is appropriately sober. It names theoretical guarantees, self-reflection and reward models, benchmarks, security and privacy, scalability, and performance.

These are not side issues. They define whether Data Agents become infrastructure or remain impressive diagrams.

Theoretical guarantees matter because LLMs and semantic operators can hallucinate. In analytics, a wrong answer may mislead strategy. In database operations, a wrong diagnosis may worsen an outage. The more autonomous the agent becomes, the more correctness stops being a nice-to-have.

Benchmarks matter because agent systems are heterogeneous. A single average score is not enough. The paper’s skill-based benchmark is a useful direction, but benchmark generation, evaluation consistency, and domain realism remain unresolved.

Privacy matters because Data Agents need context. They may access user requests, domain data, intermediate results, logs, and diagnostic documents. The more memory the system has, the more governance it needs.

Scalability matters because enterprise data is not a demo CSV. Large tables, multi-modal datasets, long-running model training jobs, and distributed engines introduce cost and latency constraints that cannot be hand-waved away by calling another LLM.

The practical boundary is therefore clear: deploy Data Agents first in bounded domains where inputs, tools, permissions, and success criteria can be constrained. Expand only after the organisation can monitor the agent’s plans, inspect intermediate outputs, control memory, and evaluate results against real operational standards.

The shift from ETL to orchestral intelligence

ETL was built for movement: get data from one place to another, transform it into shape, and load it into a system of record or analysis.

Data Agents are built for coordination: understand the task, identify the data, select the tools, compose the workflow, execute the plan, inspect the result, and adapt when reality refuses to behave.

That is the shift the paper captures. It is not a finished product category. It is a blueprint for where data systems are heading as LLMs become embedded in the control logic of enterprise infrastructure.

The most useful part of the paper is its mechanism-first view. It does not ask us to believe in autonomous agents as a personality type. It asks us to think about the missing orchestration layer between natural-language intent and heterogeneous data execution.

That layer will not replace databases, catalogs, pipelines, notebooks, or DBAs. It will sit above them, route between them, and gradually absorb the coordination work that currently lives in human judgement.

Which means the next phase of enterprise data automation may not be ETL with a chatbot taped to the front. It may be something closer to orchestral intelligence: a system that knows which instrument should play, when, and what to do when the trumpet comes in early.

Still, keep a conductor nearby.

Cognaptus: Automate the Present, Incubate the Future.


  1. Zhaoyan Sun, Jiayi Wang, Xinyang Zhao, Jiachi Wang, and Guoliang Li, “Data Agent: A Holistic Architecture for Orchestrating Data+AI Ecosystems,” arXiv:2507.01599, 2025, https://arxiv.org/abs/2507.01599↩︎