TL;DR for operators
Enterprise AI will not become useful merely because someone bolts a chatbot onto a database and calls the result an “agent”. That is theatre with API keys.
The paper behind this article proposes something more sober: a blueprint architecture for compound AI systems in the enterprise, where LLMs are important but not sovereign.1 The core idea is that enterprise AI should be built as a distributed system, not as a heroic model prompt. Streams carry data and control messages. Registries expose existing APIs, models, and datasets as searchable assets. Task planners convert user intent into executable workflows. Data planners work out which databases, documents, models, or transformations are needed. Coordinators execute plans while tracking cost, latency, and quality budgets.
That matters because most enterprises already have useful machinery: databases, predictive models, search systems, ranking services, compliance rules, UI workflows, and operational APIs. The problem is not that these assets are worthless. The problem is that they are usually invisible to an LLM unless someone wires them in manually, repeatedly, and rather optimistically.
The paper’s HR case study, Agentic Employer, shows how the architecture can support recruiter-facing workflows where UI events and natural-language requests trigger agents for summarisation, natural-language-to-query conversion, SQL execution, and result explanation. This is useful as an implementation illustration. It is not a benchmark. There are no ROI numbers, no latency comparisons, and no production reliability study. So the business lesson is architectural, not evidential: if enterprises want agentic AI to become operational software, they need an asset map, an orchestration substrate, and budget-aware execution — not just more prompts wearing a tiny architect’s hat.
The old data flow does not disappear; it gets surrounded
A familiar enterprise scene: a team wants to add AI to an existing workflow. There is a database here, a search index there, a proprietary model somewhere behind an internal API, a few dashboards nobody wants to touch, and a compliance process that still believes “dynamic” is a dangerous word.
The quick solution is to build an LLM wrapper. The user asks a question, the model calls a tool, the tool hits a database, and the answer comes back in polite prose. Everyone applauds. Then the second use-case appears. And the third. And the fourth one needs a different database, a model ranking step, a user confirmation screen, a cost ceiling, an audit trail, and a fallback when the LLM invents a table name. The wrapper begins to resemble a ball of yarn after a cat-led transformation programme.
Kandogan et al.’s paper is aimed at that exact failure mode. Its argument is not that enterprises need “agents” in the vague, investor-friendly sense. It argues that enterprise compound AI needs a unifying architecture with explicit abstractions for integration, orchestration, coordination, and optimisation.
That is a different claim from “LLMs can use tools”. Tool use is a capability. Enterprise deployment is a systems problem. The paper’s contribution is to move the discussion from model behaviour to operating structure.
The practical question changes:
Not “Can the LLM call the right API?” But “Can the enterprise make its APIs, data, models, plans, budgets, events, and outputs visible enough for many AI workflows to be built, monitored, reused, and controlled?”
That is a less glamorous question. Conveniently, it is also the one that matters.
The paper’s real unit of analysis is the compound AI system
The paper starts from a familiar observation: LLM adoption in enterprises is difficult because companies want to use proprietary data, existing models, internal APIs, and controlled workflows while still meeting requirements around cost, quality, latency, privacy, and reliability. LLMs alone do not solve this. They may even make the integration problem worse by hiding it behind fluent text.
The authors position their work within the shift from monolithic models to compound AI systems. In a compound AI system, the answer comes not from one model but from a coordinated set of components: retrievers, databases, specialised models, APIs, verifiers, planners, user interfaces, and LLMs. The paper’s complaint is that current progress has been piecemeal. There are agent frameworks, tool-calling techniques, RAG systems, workflow optimisers, and programming models, but not yet a clear enterprise architecture that ties them together.
So the blueprint introduces the following components:
| Component | Mechanism | Operational consequence |
|---|---|---|
| Streams | Carry data and control messages among components | Orchestration becomes explicit, observable, and event-driven |
| Agents | Wrap models, APIs, tools, services, and UI-facing components | Existing enterprise assets become callable compute units |
| Agent registry | Stores metadata, descriptions, inputs, outputs, deployment details, and learned representations of agents | Planners can discover and select capabilities instead of relying on hard-coded tool lists |
| Data registry | Stores metadata about enterprise data sources, schemas, modalities, indices, and learned representations | AI workflows can search and reason over available data assets |
| Sessions | Scope context, streams, agent participation, and outputs | Multi-step workflows become traceable within a bounded interaction |
| Task planner | Converts user intent into a workflow, often represented as a DAG of agent invocations | Natural-language requests become structured work plans |
| Data planner | Decomposes retrieval and transformation needs across databases, documents, graphs, and models | “Ask the data” becomes a query-planning problem, not just NL2SQL with bravado |
| Task coordinator | Executes plans, invokes agents, monitors progress, and tracks budgets | Workflows can be managed against cost, latency, and quality constraints |
| Optimiser | Chooses among plans or operators under multiple objectives | The system can reason about trade-offs rather than simply execute the first plausible chain |
This is the mechanism-first reading of the paper: its value is not any single component. It is the way the components create an enterprise operating model for agentic AI.
Streams make orchestration observable instead of mystical
The central abstraction is the stream.
In the blueprint, a stream is a sequence of messages. Those messages may contain data, such as user text, query outputs, or model responses. They may also contain control instructions, such as a signal to invoke a SQL executor or a summarisation agent. Components can subscribe to streams, consume messages, emit new messages, and trigger downstream work.
That may sound like standard event-driven architecture, and in part it is. The important move is that streams are treated as first-class data resources for AI orchestration. They do not merely move payloads. They externalise the flow of work.
This matters because many agent systems are narratively dynamic but operationally opaque. The model “decides” what to call next. The agent “reasons” about a tool. A chain “emerges”. Very poetic. Less pleasant when the CFO asks why one user query invoked four expensive models, two databases, and a hallucinated summariser with the confidence of a weather app in typhoon season.
Streams give the system a place to represent what happened. A user message enters one stream. A planner emits a plan into another. A coordinator emits control instructions. Agents consume inputs and write outputs. Tags can determine which agents listen. UI events can be processed through the same mechanism as conversational text.
The business value is not that streams are fashionable. The business value is auditability and control. If data and control exchanges are explicit, the organisation can monitor flows, debug failures, apply governance rules, and understand how a user request moved through the system.
The old enterprise data flow does not vanish. Batch jobs, APIs, databases, and services still exist. The blueprint wraps them in an orchestration layer that can support more adaptive workflows without pretending that adaptation is magic.
Agents are wrappers for enterprise capability, not digital employees
A common misconception is that an enterprise agent must be an LLM-powered persona. The paper uses a broader and more useful definition.
An agent is any compute entity that processes input and generates output. That can be an LLM, but it can also be a traditional API, a CRF model, a search interface, a ranking model, a SQL executor, a summariser, a profile collector, or a UI component. The agent abstraction is a wrapper around capability.
This is quietly important. Most enterprises do not suffer from a shortage of capabilities. They suffer from fragmented capability exposure. A fraud model exists but is not easily discoverable by a customer-support workflow. A matching algorithm exists but is embedded inside one product. A document database exists but is poorly described. An internal API exists but only three engineers know its real behaviour, and one has left to become a consultant, as tradition demands.
The agent abstraction gives these assets a standard interface: inputs, outputs, triggering rules, processor logic, deployment configuration, and metadata. Agents may be triggered centrally by instructions or more autonomously by monitoring streams for tags. The paper even draws inspiration from Petri nets to describe how agents can wait for required inputs from multiple streams before executing.
That last point is more than academic decoration. In real workflows, an agent may need several inputs: a user profile, a job listing, a ranking criterion, and a policy constraint. If those arrive separately, the system needs a disciplined mechanism for pairing them and triggering work only when the necessary tokens are available. Otherwise, agentic workflows become race conditions with a user interface.
The operational consequence is simple: AI integration becomes less about one-off glue code and more about standardising how enterprise capabilities are exposed.
Registries turn hidden assets into searchable infrastructure
If streams are the circulatory system, registries are the memory of what the system can actually do.
The agent registry stores descriptions, input and output parameters, deployment details, stream rules, docker images, configurations, and other properties. It supports search and query over agent metadata. The paper also suggests learned representations derived from metadata and logs, so that usage history can improve discovery.
The data registry plays an analogous role for enterprise data. It stores metadata across different levels: lakehouse, lake, source system, database, table, schema, index, document collection, graph database, key-value store, and so on. It may include embeddings derived from schema details, data contents, structure, and query histories.
This is where the paper becomes most relevant for business practice. A great deal of enterprise AI failure is actually catalogue failure. The model cannot use what the organisation cannot describe. The planner cannot select an asset it cannot discover. The coordinator cannot enforce policy around a dataset whose permissions and lineage are undocumented. The governance team cannot audit a workflow that was assembled from invisible assumptions and Slack archaeology.
Registries do not make AI intelligent by themselves. They make enterprise assets legible to the AI system. That is the precondition for reuse.
This also shifts the ROI logic. Buying a stronger model may improve local performance. Building registries improves the organisation’s ability to reuse capabilities across many workflows. One is a model upgrade. The other is infrastructure leverage.
Task planning converts intent into work, but not by trusting the LLM with everything
The task planner listens to user input and creates a workflow that available agents can execute. In the paper’s running example, a user says: “I am looking for a data scientist position in SF bay area.” The planner may decompose that into gathering background information, matching the user profile against jobs, and presenting results. These actions map to agents such as Profiler, Job Matcher, and Presenter.
The plan is represented as a directed acyclic graph connecting agent inputs and outputs. The planner uses metadata from the agent registry to identify suitable agents. It can also be interactive, dynamic, incremental, and adaptive based on feedback.
This is not the same as letting an LLM improvise a tool chain from a prompt. The blueprint deliberately places the LLM inside a larger architecture. The authors are explicit that LLMs should not necessarily handle everything from planning to invoking tools to fetching data. The system should include dedicated components with designed logic and controls.
That distinction matters. Planning in enterprise workflows is not merely a reasoning task. It is a policy task, a cost task, a latency task, a data-access task, and a coordination task. A plan may be semantically plausible but operationally unacceptable. For example, it may use the wrong data source, exceed budget, violate privacy boundaries, or invoke a model whose accuracy is unsuitable for the decision context.
So the planner is not valuable because it makes the workflow “agentic”. It is valuable because it turns a natural-language request into a structured object that can be inspected, executed, revised, and optimised.
Data planning is where the old NL2SQL fantasy grows up
The data planner may be the most underrated component in the paper.
A naive enterprise AI design treats data access as a translation problem: take a natural-language question and convert it into SQL. Sometimes that works. Often it does not. Enterprise data is messy, distributed, multi-modal, inconsistently named, and decorated with historical compromises that nobody would admit to in daylight.
The paper gives a simple example. A user asks for a “data scientist position in SF bay area.” A direct database query may fail because “SF bay area” is not a city value in the database. The system may need an LLM or other knowledge source to expand the region into a list of relevant cities. Similarly, “data scientist” may require a title taxonomy in a graph database to capture related roles. The query must be decomposed into sub-tasks: identify cities, identify job-title variants, then apply a selection over the jobs table.
This is the key move: natural-language data access becomes query planning over heterogeneous sources, not a single translation step.
The paper suggests that data planners may need operators beyond established relational operators. They may need to discover sources, extract text, summarise content, compare entities, transform natural-language criteria, or use an LLM as a data source for certain sub-tasks. The planner must also choose among sources and operator configurations under constraints such as cost, performance, and quality.
For business readers, this is the part worth slowing down for. Many enterprise AI projects underestimate data planning because demos are built on clean examples. Production systems are built on everything else.
The data planner is the architectural answer to a stubborn fact: the meaning of a business question often spans databases, taxonomies, documents, models, and unstated organisational knowledge. A model can help interpret the question. It cannot magically turn bad metadata into a coherent data estate. Annoying, yes. Also true.
The coordinator makes budgets operational
The task planner decides what should happen. The task coordinator handles execution.
It receives a plan, emits control messages to invoke agents, monitors workflow progress, maps outputs from one agent into inputs for another, invokes the data planner when transformations are needed, and tracks budgets. Those budgets may include cost, execution time, and quality measures such as accuracy.
This is where the paper’s architecture becomes recognisably enterprise-grade. A workflow that cannot track cost and latency is a prototype, not a service. A workflow that cannot abort, replan, or ask for confirmation when it exceeds thresholds is not autonomous; it is merely unsupervised.
The coordinator also separates planning from execution. That separation is valuable because it creates intervention points. A plan can be generated, inspected, estimated, executed, monitored, and revised. The organisation can define where humans must approve budget violations or where the system may replan automatically.
In business terms, the coordinator is what turns “agentic workflow” into controlled operations. Without it, an AI system may still work on a demo day. It just has no adult supervision.
The HR case study illustrates the blueprint; it does not prove the economics
The paper’s case study is an HR application called Agentic Employer. It allows employers to interact with applicants through a graphical interface and conversational input. Users can select jobs, ask questions, generate summaries, and receive visual outputs.
The implementation shows two useful flows.
First, a UI-driven flow: a user selects a job ID in a form. That event is emitted into a stream. The Agentic Employer agent receives it, emits the job ID into another stream, and creates a plan to invoke a summariser. The task coordinator listens for the plan, unrolls it, emits a control message, and the summariser generates the output.
Second, a conversation-driven flow: a user enters text into the conversation. An intent classifier identifies the request as an open-ended query. The Agentic Employer agent tags the query as a natural-language query. An NL2Q agent converts it into a suitable database query, such as SQL. A query executor runs it, and a query summariser explains the result.
The important point is not that this HR workflow is revolutionary. Recruiters filtering applicants is not exactly the moon landing. The point is that UI events and natural-language requests can be handled through the same stream-based orchestration model.
That is a meaningful architectural demonstration. It shows the blueprint can support mixed interaction modes: clicks, forms, text, plans, SQL, summaries, and visual outputs. But the paper is careful about scope. Full details of the application are outside the paper, and the case study is used to validate and showcase the proposal, not to provide a quantitative evaluation.
So we should classify the paper’s artefacts correctly:
| Paper element | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Blueprint architecture overview | Main architectural contribution | Shows how streams, registries, planners, coordinators, and budgets fit together | Does not prove performance superiority |
| Deployment diagram | Implementation detail | Shows how components could run across enterprise clusters and containers | Does not establish scalability under production load |
| Agent, registry, task plan, and data plan figures | Mechanism explanation | Clarifies how components interface and trigger work | Does not validate correctness across domains |
| Agentic Employer case | Implementation illustration | Shows the architecture can express HR workflows using UI and conversational inputs | Does not prove ROI, reliability, latency, or accuracy improvements |
| Discussion of open research questions | Boundary and research agenda | Identifies unresolved problems in agents, data, planning, optimisation, and reliability | Does not solve those problems |
This matters because architecture papers are easy to over-sell. A blueprint is not a building. A case study is not an occupancy permit.
The business value is reuse, observability, and optimisation — in that order
The obvious business story would be: “This architecture helps enterprises adopt AI.” True, but bland enough to be used as a slide title by someone who says “leverage synergies” without flinching.
The more useful business story has three layers.
First, reuse. By wrapping existing APIs, models, and services as agents, the organisation avoids rebuilding every workflow from scratch. A matching model, summariser, SQL executor, or ranking service can become a registered component discoverable by planners. This is where infrastructure investment compounds.
Second, observability. By representing data and control flows through streams and sessions, the system gives operators a way to see what happened. This supports debugging, monitoring, auditing, and governance. In regulated or high-stakes settings, observability is not a nice extra. It is the difference between software and an expensive liability generator.
Third, optimisation. Once plans, components, metadata, and budgets are explicit, the organisation can start optimising across cost, latency, and quality. For example, a system may choose a cheaper model for low-risk summarisation, a specialised model for ranking, or a human approval step when uncertainty is high. The paper does not provide a finished optimiser, but it places optimisation in the architecture where it belongs.
This ordering matters. Many teams want optimisation before they have reuse or observability. That is like tuning an engine that has not yet been assembled. Admirable enthusiasm. Limited effect.
The governance angle is hidden inside the architecture
The paper does not present itself as a governance paper, but governance is baked into its architecture.
A data registry can hold metadata about schemas, indices, source systems, and potentially access constraints. An agent registry can expose deployment details, input-output expectations, and triggering rules. Sessions scope context. Streams record data and control exchanges. Coordinators monitor budgets and can abort or replan. Planners can be constrained by metadata and policies.
These mechanisms matter because enterprise AI governance cannot depend entirely on after-the-fact review. If an agentic workflow dynamically chains tools, accesses data, transforms outputs, and generates user-facing decisions, governance needs to exist at the orchestration layer.
The paper’s discussion section surfaces unresolved governance-adjacent questions directly. How should data privacy be supported when agents have different privileges? How can planners add verification and constraints? How can systems handle uncertainty from LLMs? How should user experience support debugging and mitigation of agent errors?
The blueprint does not answer all of this. It does something more preliminary: it creates places in the architecture where those answers could live.
That is less satisfying than a complete solution. It is also more credible.
Where this architecture applies — and where it may be too much
This blueprint is most relevant when three conditions hold.
First, the enterprise has many existing assets: databases, APIs, specialised models, search systems, document stores, and workflows. The more fragmented the asset base, the more valuable registries become.
Second, the use-cases require multi-step coordination rather than single-turn answers. If all you need is a simple FAQ bot over a narrow knowledge base, this architecture may be excessive. Please do not bring a distributed compound AI blueprint to a toaster.
Third, the organisation cares about cost, latency, reliability, and governance. If these constraints are real, planning and coordination cannot be left as prompt engineering folklore.
The architecture is less compelling for small teams with limited data sources, simple workflows, or low-risk use-cases where a direct application pattern is enough. It also introduces its own complexity. Registries must be maintained. Metadata quality matters. Planners need validation. Streams require infrastructure. Optimisers need cost models. Coordinators need policy rules. The architecture makes complexity explicit; it does not make complexity vanish.
This is a feature, not a bug, but only for organisations mature enough to use it.
The unresolved questions are not footnotes; they are the implementation backlog
The paper is unusually clear that much remains to be designed.
For agents, the open questions include how agents adapt to new tasks, learn from feedback, generate useful user interaction components, and produce learned representations from historical performance.
For data, the open questions include how to represent data across granularity levels, modalities, schemas, query histories, graphs, documents, and business context. The paper also points to the need for new data operators for text, graphs, and multi-modal data. This is a large research agenda hiding inside a tidy phrase.
For planning, the paper acknowledges that LLMs alone still cannot solve planning. The challenge is to exploit LLMs where useful while adding verification, constraints, multi-modal data awareness, collaborative refinement, and feedback attribution.
For optimisation, the hard question is cost estimation under uncertainty. How expensive is a plan before it runs? How reliable is a model on this specific data slice? How should accrued budget influence replanning? How should the system trade a slower specialised model against a faster general one?
For reliability, agentic workflows add nondeterminism to distributed systems. Traditional fault tolerance assumes certain kinds of failure. Agent workflows introduce others: reasoning failure, tool misuse, cascading error, weak intermediate outputs, and ambiguous responsibility. The architecture can host reliability mechanisms, but the mechanisms still need to be built.
This is the practical boundary. The blueprint is a strong conceptual frame. It is not a shrink-wrapped platform.
What operators should take from the paper
The useful takeaway is not “adopt this exact architecture tomorrow”. The useful takeaway is a diagnostic lens.
Ask whether your enterprise AI system has clear answers to the following:
| Diagnostic question | Why it matters |
|---|---|
| Are internal models, APIs, tools, and services registered as reusable capabilities? | Without this, every AI workflow becomes bespoke integration work |
| Are data sources described at the level needed for planning, discovery, access, and governance? | Without this, natural-language data access will fail outside clean demos |
| Are data and control flows observable as explicit events or streams? | Without this, debugging and auditability collapse |
| Can user intent be converted into an inspectable task plan? | Without this, “agentic” behaviour remains opaque |
| Can data retrieval be decomposed across sources and modalities? | Without this, NL2SQL becomes a brittle bottleneck |
| Does execution track cost, latency, and quality budgets? | Without this, production use becomes financially and operationally unpredictable |
| Can the system abort, replan, or request human confirmation? | Without this, autonomy becomes risk transfer disguised as innovation |
If the answer to most of these questions is no, the problem is probably not the model. It is the operating architecture around the model.
That is the paper’s quiet provocation. Enterprises do not need to choose between old data flows and new AI agents. They need an architecture that lets both be represented, connected, monitored, and optimised. Otherwise, agentic AI becomes another layer of accidental complexity on top of the old one. Progress, in the traditional enterprise sense: now with extra dashboards.
Conclusion: compound AI needs plumbing, not theatre
The paper’s best insight is that enterprise AI should be treated as a systems architecture problem. LLMs remain useful, but they are not the whole system. In the proposed blueprint, the important work happens in the plumbing: streams for orchestration, registries for discoverability, planners for decomposition, coordinators for execution, and budgets for operational discipline.
This reframes enterprise AI adoption. The question is not whether a model can answer a question in a demo. The question is whether the organisation can expose its proprietary assets in a way that many AI workflows can discover, combine, govern, and reuse.
That is less dramatic than the usual agentic AI storyline. No swarm of digital workers. No autonomous intern army. No synthetic middle manager explaining why the SQL query failed.
Just architecture.
Which, in enterprise software, is often where the real intelligence has been hiding all along.
Cognaptus: Automate the Present, Incubate the Future.
-
Eser Kandogan, Nikita Bhutani, Dan Zhang, Rafael Li Chen, Sairam Gurajada, and Estevam Hruschka, “Orchestrating Agents and Data for Enterprise: A Blueprint Architecture for Compound AI,” arXiv:2504.08148, 2025. https://arxiv.org/abs/2504.08148 ↩︎