Assembly lines are not exciting because every worker improvises.
They are useful because each station does a narrow job, hands the result forward, and leaves as little room as possible for charming chaos. That is also the quiet lesson in A Practical Guide for Designing, Developing, and Deploying Production-Grade Agentic AI Workflows.1 The paper looks, at first glance, like another guide to agents, tools, MCP servers, multi-model reasoning, and cloud-native deployment. The tempting summary would be: “Here are nine best practices for building agentic AI.”
That summary is accurate, but too flat.
The more interesting argument is mechanical: production-grade agentic AI improves when the system deliberately reduces the number of moments where the LLM must guess. The authors do not argue for agentic complexity as a virtue. They argue for a controlled division of labor: keep language judgment where language judgment creates value, and move infrastructure behavior into ordinary deterministic software.
Yes, the paper discusses agents. Yes, it uses MCP. Yes, it demonstrates a multi-agent multimodal workflow. But the most practical message is almost anti-theatrical: if the task is committing files, creating pull requests, generating timestamps, calling APIs, storing prompts, routing requests, or deploying services, do not ask the model to perform interpretive dance. Use code.
The production problem is not intelligence; it is uncontrolled discretion
Most agent demos succeed by collapsing several jobs into one impressive chain. The model searches, reads, reasons, writes, calls tools, fixes errors, and publishes output. In a demo, this feels magical. In production, it becomes a debugging swamp with a logo.
The paper frames the problem through a familiar enterprise transition: prototypes are easy, but production systems require reliability, observability, maintainability, governance, and reproducibility. In agentic systems, those goals are harder because LLMs add probabilistic behavior inside workflows that otherwise need operational discipline.
The authors’ practical response is not “build a smarter agent.” It is closer to “remove unnecessary choices from the agent.”
That gives us the core mechanism:
| Workflow decision | Bad production instinct | Better mechanism in the paper |
|---|---|---|
| External system access | Let the agent interpret broad MCP/tool metadata | Use direct tool calls or pure functions when possible |
| Infrastructure actions | Ask the LLM to create commits, files, pull requests, or API writes | Put these steps in deterministic orchestration code |
| Agent scope | Give one agent many tools and responsibilities | Use single-tool and single-responsibility agents |
| Prompt management | Hide prompts inside source code | Store prompts externally and load them at runtime |
| Model reliability | Trust one model’s answer | Use multiple model drafts and a reasoning consolidator |
| Workflow exposure | Embed workflow logic inside the MCP server | Keep MCP as a thin adapter over a REST-backed workflow |
| Deployment | Run scripts manually or semi-manually | Containerize and deploy through Docker/Kubernetes |
The principle underneath the table is simple: uncertainty should be budgeted. Spend it where semantic interpretation is required. Do not spend it on plumbing.
The case study is a media pipeline, but the mechanism is enterprise automation
The paper demonstrates its guidance through a podcast-generation workflow. The domain sounds lightweight: collect news, generate a podcast script, produce audio and video artifacts, and publish the result. But architecturally, this is a useful stress test because it combines retrieval, filtering, scraping, multi-model generation, reasoning consolidation, multimodal output, and software publishing.
The workflow starts with user-provided topics and source URLs. A search agent collects recent updates, a filtering agent selects relevant articles, and a scraping agent converts selected pages into clean Markdown. That Markdown becomes the grounding material for the generation stage.
Then the system calls multiple podcast script generation agents, each powered by a different LLM family. The paper’s evaluation section describes a consortium composed of Llama, OpenAI, and Gemini agents. The authors observe that these models produce different styles: Llama tends to be concise and structured, OpenAI more detailed and narrative-driven, and Gemini more stylistically flowing and contextual. That diversity is useful, but it also introduces emphasis drift and factual variation.
So the workflow adds a reasoning agent. Its job is not to create from nothing. It compares the drafts, resolves inconsistencies, removes speculative claims, and produces a consolidated final script. After that, other components convert the script into audio/video instructions, including Veo-3 JSON, text-to-speech output, video generation, and GitHub publishing.
This is where the case study becomes more than “AI makes a podcast.” The workflow contains nearly every failure mode that enterprise automation teams will recognize:
| Stage | What can go wrong | Stabilizing mechanism |
|---|---|---|
| Search and filtering | Irrelevant or stale inputs enter the pipeline | Specialized retrieval and filtering agents |
| Scraping | Raw HTML or messy web content contaminates downstream generation | Markdown conversion before generation |
| Script drafting | Models emphasize different facts or styles | Multi-model consortium |
| Consolidation | Drafts conflict or drift beyond the source material | Reasoning agent with cross-model comparison |
| Video prompt generation | Natural language mixes with required JSON | Dedicated JSON builder agent |
| Media/API execution | Agents hallucinate file paths or status messages | Non-agent execution functions |
| Publishing | MCP/tool ambiguity breaks the final operational step | Direct function invocation for GitHub PR creation |
The paper’s strongest contribution is not that this specific podcast system exists. It is that the authors use the system to show where agentic discretion should be narrowed.
MCP is useful as an interface, not as a place to hide workflow logic
A casual reader might expect the paper to be pro-MCP in the usual breathless way: standardize everything, connect everything, let agents discover tools everywhere, and enjoy the future until the stack trace arrives.
The paper is more pragmatic.
The authors initially used a GitHub MCP server to create pull requests for generated podcast artifacts. During evaluation, they observed recurring issues: ambiguous tool selection, inconsistent parameter inference, and non-deterministic MCP responses. Their diagnosis is important. The failure was not merely that MCP was “bad.” The problem was that the agent had to reason through tool metadata and choose the correct invocation pathway for an infrastructure action that did not actually require language reasoning.
The fix was not another elaborate prompt. They replaced that step with direct pull-request creation logic.
Then they went further. Even a dedicated PR agent using a structured tool still required the model to reason about parameters and produce a tool call. The authors eventually removed the PR agent and invoked create_github_pr directly from the workflow controller. That is the paper’s philosophy in miniature: when the desired behavior is deterministic, do not wrap it in probabilistic interpretation just because the architecture diagram looks more agentic.
But the paper does not reject MCP. In the implementation section, the authors expose the workflow backend through a REST API and build a corresponding MCP server so MCP-enabled clients can invoke the workflow. The difference is architectural placement. MCP becomes an adapter layer, not the owner of business logic.
That distinction matters for business systems.
A finance team building an AI reporting workflow, a compliance team automating evidence collection, or a legal operations team generating document summaries may still want MCP-style interoperability. But they should not bury operational logic inside the MCP server or ask an LLM to decide how to perform deterministic back-office actions. The workflow engine should remain testable, observable, and version-controlled. The MCP server should translate access, not absorb responsibility.
This is the boring answer. It is also the one that survives contact with production.
“One agent, one job” is not aesthetic minimalism; it is failure containment
The paper’s agent design recommendations may sound like ordinary software-engineering hygiene: avoid overloading agents, use single-responsibility components, keep workflows simple. That familiarity is exactly the point. Agentic AI is not exempt from software discipline because the components can write fluent English.
The authors give a concrete example. An early workflow design used one agent with two tools: scrape_markdown and publish_markdown. The idea was straightforward: scrape content, then publish the extracted Markdown for audit purposes. In practice, the agent sometimes called only one tool, called tools in the wrong order, or failed to call them entirely, especially as prompt or input size increased.
The repair was simple: split the design into two independent agents, each responsible for one tool.
The same pattern appears in the video-generation section. An early design asked one agent to transform the final script into a Veo-3 JSON specification and also generate the corresponding video. That mixed two very different responsibilities: planning a structured video prompt and executing an external media-generation process. The model sometimes produced malformed JSON, mixed prose with JSON, or hallucinated file paths and status messages.
The fix was again decomposition. A Veo JSON Builder Agent produces valid JSON. A separate non-agent function receives that JSON, calls the video API, handles retries, saves the file, and reports status.
The operational lesson is sharper than “modularize your agents.” It is this:
Each extra responsibility given to an LLM expands the space of possible failure modes.
A traditional function with two responsibilities may be ugly. An LLM agent with two responsibilities may be unstable. The difference is that the LLM must infer intent, maintain context, select output format, and decide when to call tools. Every additional duty increases cognitive load inside a component that is already probabilistic.
That is why “one agent, one job” is not just clean architecture. It is a risk-control mechanism.
Prompt governance belongs outside the codebase’s nervous system
The paper also recommends storing prompts externally and loading them at runtime. This may look like a small implementation detail. It is not.
In agentic systems, prompts are not comments. They are executable policy. They define role boundaries, output formats, safety rules, escalation behavior, source-grounding expectations, and sometimes the difference between a stable workflow and a polite disaster.
Embedding prompts directly inside source code creates two problems. First, prompt changes become code changes, which slows iteration and makes business review awkward. Second, prompt history becomes harder to govern. In regulated or client-facing workflows, it matters which instruction version produced which output.
The authors store prompts in a dedicated GitHub repository and load them dynamically. That enables review, version pinning, rollback, and controlled access. It also lets non-engineering stakeholders participate in prompt refinement without editing workflow code.
For business teams, this is the bridge between “AI experimentation” and “AI operations.” A prompt repository can support:
| Prompt operation | Business reason |
|---|---|
| Versioning | Know which instruction set generated an output |
| Review workflow | Let domain owners approve behavior changes |
| Rollback | Recover quickly when a prompt revision causes drift |
| A/B testing | Compare prompt variants without redeploying core code |
| Red-team updates | Add safety constraints as failure modes are discovered |
| Access control | Separate prompt authors, reviewers, and deployers |
This is not glamorous. It is also the sort of thing that prevents a client-facing AI system from silently changing behavior on a Tuesday because someone “improved the prompt.”
Multi-model reasoning is useful, but not magic truth serum
The paper’s Responsible-AI mechanism is a model consortium followed by a reasoning agent. In the podcast workflow, different LLMs independently draft scripts from the same scraped content. The reasoning agent then compares those outputs, identifies shared facts, handles conflicts, removes speculative claims, and produces a consolidated script.
This is a sensible pattern. It creates diversity at the draft stage and compression at the final stage. The system harvests variation where variation is useful, then narrows the output before publication.
But the evidence should be read carefully. The evaluation section is mainly qualitative and implementation-oriented. It shows representative outputs from the script generation agents, the reasoning agent’s consolidation prompt, a consolidated script, video-script output, and Veo-3 JSON output. The authors report that the Veo-3 JSON builder consistently produced well-formed JSON across multiple runs, but the paper does not present a broad statistical benchmark, cross-domain evaluation, or quantified reliability comparison.
That does not make the evaluation useless. It tells us what kind of evidence it is.
| Paper element | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Podcast-generation case study | Main implementation evidence | The proposed practices can be assembled into a working end-to-end workflow | That the architecture is optimal across industries |
| MCP-to-direct-function examples | Design comparison inside the case | Removing LLM interpretation from deterministic steps can reduce observed instability | A universal rejection of MCP |
| Multi-model script outputs | Exploratory demonstration of model diversity | Different models produce different emphasis and style from the same source material | That consensus always improves factual accuracy |
| Reasoning-agent consolidation | Main workflow mechanism | A dedicated consolidator can reconcile drafts and reduce speculation within this pipeline | That consolidated output is automatically true |
| Veo-3 JSON builder examples | Implementation-detail validation | Narrow agents with strict output contracts can produce usable structured artifacts | That all schema-generation tasks will be robust without further testing |
| Docker/Kubernetes deployment | Operational implementation evidence | The workflow can be packaged and deployed in cloud-native form | That containerization alone ensures AI reliability |
The useful business inference is not “use three models and you are safe.” The better inference is: when outputs are subjective or generative, use structured comparison and consolidation before taking action. Consensus is not truth. But it is often a better control surface than a single model’s unchecked confidence.
For high-stakes settings, the consolidator should still be paired with source verification, logging, human review thresholds, and domain-specific evaluation. The paper’s architecture points in that direction; it does not remove the need for those controls.
The real workflow pattern is variation, consolidation, determinism
The paper’s mechanism can be summarized as a three-zone pipeline:
[Variation Zone] [Consolidation Zone] [Deterministic Zone]
LLM drafts Reasoning agent Functions, APIs, storage,
multi-model outputs -> conflict resolution -> GitHub PRs, files,
semantic judgment source-grounded synthesis deployment, observability
This pattern is more useful than a checklist because it tells teams where to place intelligence.
In the variation zone, multiple models or agents can produce candidate outputs. This is valuable when the task benefits from language diversity: drafting, interpretation, summarization, explanation, scenario generation, and creative planning.
In the consolidation zone, a reasoning agent narrows those candidates. It compares, reconciles, filters, and structures. The prompt used for this role matters because the agent must not simply average the drafts. It must prefer shared facts, drop speculative or conflicting details, and acknowledge uncertainty rather than inventing closure.
In the deterministic zone, ordinary software takes over. File operations, API calls, publishing, schema validation, retries, observability, deployment, and access control should be boring. The fewer interpretive decisions here, the better.
Many failed agentic systems invert this pattern. They use one model to produce a final answer, then let agents improvise the operational steps afterward. That is backwards. Let models explore and reason while the cost of variation is still contained. Once the system reaches execution, narrow the pathway.
Deployment is not an afterthought; it is part of the AI design
The implementation section matters because it pulls the workflow out of notebook-land. The authors use the OpenAI Agents SDK, expose the workflow through a REST API, build a separate MCP server as an adapter, release workflow and MCP-server repositories, include Dockerfiles and Kubernetes deployment manifests, and test MCP interoperability through LM Studio.
This is not just DevOps decoration. Agentic workflows need deployment discipline because their failure modes are not limited to bad answers. They include retry storms, tool-call failures, API quota problems, inconsistent model behavior, prompt drift, dependency mismatch, missing secrets, unhealthy containers, and silent partial completion.
Containerization addresses part of that problem by creating reproducible runtime environments. Kubernetes adds scaling, restarts, health checks, isolation, and deployment patterns such as blue-green or canary releases. Observability tools can track workflow execution across agent stages.
Still, deployment infrastructure does not make a bad agent design good. It makes a disciplined design operable. That distinction matters. Kubernetes can restart a failed pod. It cannot decide whether your agent should have been allowed to both generate JSON and pretend it created a video file.
The business value is not “we used Kubernetes.” The value is that the AI workflow becomes inspectable and maintainable enough to enter the same operational universe as other enterprise software.
What businesses should take from the paper
The paper directly shows a full implementation of a multimodal news-to-podcast workflow built with specialized agents, external tools, multi-model generation, reasoning consolidation, REST/MCP separation, and containerized deployment. It also reports specific design failures observed during development, especially around MCP/tool ambiguity, overloaded agents, and mixed planning/execution responsibilities.
From that, Cognaptus would infer a practical operating model for enterprise agentic AI:
| Design choice | Direct paper basis | Business interpretation |
|---|---|---|
| Replace unnecessary agent calls with pure functions | GitHub PR step moved from MCP/tool/agent logic to direct function invocation | Reduce cost, latency, ambiguity, and debugging burden |
| Use single-responsibility agents | Scraping/publishing and Veo JSON/video generation examples | Make failures local and contracts testable |
| Externalize prompts | Prompts stored separately and loaded at runtime | Treat prompts as governed operational artifacts |
| Use multi-model generation plus reasoning consolidation | Llama/OpenAI/Gemini drafts consolidated by reasoning agent | Improve review surface for generative outputs |
| Separate MCP from workflow logic | REST workflow backend plus MCP adapter | Preserve interoperability without sacrificing maintainability |
| Containerize the workflow | Docker/Kubernetes deployment in the implementation | Move agent systems toward normal production operations |
The uncertain part is magnitude. The paper does not quantify cost savings, reliability improvement, latency reduction, or defect-rate changes across many workflows. A company should therefore treat the paper as an engineering blueprint, not a benchmark report.
That is still valuable. Many organizations do not fail at agentic AI because they lack a more powerful model. They fail because the workflow gives the model too many unbounded responsibilities and then calls the resulting instability “AI behavior.” Very elegant. Also very avoidable.
Where the paper should not be overread
There are three boundaries worth keeping clear.
First, the evidence is mostly a detailed case study and qualitative evaluation. The authors show how the architecture works and where design changes improved stability in their workflow, but they do not provide a broad cross-industry benchmark. The right conclusion is “this is a practical architecture worth adapting,” not “this proves universal superiority.”
Second, the Responsible-AI claims are architectural rather than final. A multi-model consortium and reasoning agent can reduce some risks, especially idiosyncratic model drift and unsupported speculation. But multiple models can still share blind spots, and a reasoning agent can still consolidate plausible falsehoods if the source material is weak. Consensus is a control mechanism, not a certificate of truth.
Third, MCP is not the enemy. The paper’s more precise lesson is that MCP should be placed carefully. Use it for interoperability and client access. Avoid forcing the LLM to reason through broad MCP tool metadata for deterministic infrastructure tasks. The adapter is useful. The adapter should not become the brain.
The assembly line is the point
The best agentic workflows will not look like one giant agent with a heroic prompt. They will look like disciplined production systems: narrow stations, explicit contracts, controlled handoffs, source-grounded reasoning, deterministic execution, versioned prompts, and observable deployment.
That may disappoint people who wanted agentic AI to mean “let the model figure everything out.” But in business systems, letting the model figure everything out is often just a premium subscription to randomness.
The paper’s real contribution is to move agentic AI from performance art toward engineering practice. It shows that production-grade agents are built not by maximizing autonomy everywhere, but by deciding where autonomy belongs.
The assembly line metaphor is useful because it is not romantic. Each station does its job. The uncertain work is isolated. The final steps are controlled. The output can be inspected. The system can be improved without praying over a monolithic prompt.
That is how agentic AI becomes infrastructure.
Not by making every component intelligent.
By making the whole workflow less foolish.
Cognaptus: Automate the Present, Incubate the Future.
-
Eranga Bandara et al., “A Practical Guide for Designing, Developing, and Deploying Production-Grade Agentic AI Workflows”, arXiv:2512.08769, 2025. ↩︎