Agents on the Assembly Line: How Production-Grade AI Workflows Actually Get Built

Assembly lines are not exciting because every worker improvises.

They are useful because each station does a narrow job, hands the result forward, and leaves as little room as possible for charming chaos. That is also the quiet lesson in A Practical Guide for Designing, Developing, and Deploying Production-Grade Agentic AI Workflows.¹ The paper looks, at first glance, like another guide to agents, tools, MCP servers, multi-model reasoning, and cloud-native deployment. The tempting summary would be: “Here are nine best practices for building agentic AI.”

That summary is accurate, but too flat.

The more interesting argument is mechanical: production-grade agentic AI improves when the system deliberately reduces the number of moments where the LLM must guess. The authors do not argue for agentic complexity as a virtue. They argue for a controlled division of labor: keep language judgment where language judgment creates value, and move infrastructure behavior into ordinary deterministic software.

Yes, the paper discusses agents. Yes, it uses MCP. Yes, it demonstrates a multi-agent multimodal workflow. But the most practical message is almost anti-theatrical: if the task is committing files, creating pull requests, generating timestamps, calling APIs, storing prompts, routing requests, or deploying services, do not ask the model to perform interpretive dance. Use code.

The production problem is not intelligence; it is uncontrolled discretion

Most agent demos succeed by collapsing several jobs into one impressive chain. The model searches, reads, reasons, writes, calls tools, fixes errors, and publishes output. In a demo, this feels magical. In production, it becomes a debugging swamp with a logo.

The paper frames the problem through a familiar enterprise transition: prototypes are easy, but production systems require reliability, observability, maintainability, governance, and reproducibility. In agentic systems, those goals are harder because LLMs add probabilistic behavior inside workflows that otherwise need operational discipline.

The authors’ practical response is not “build a smarter agent.” It is closer to “remove unnecessary choices from the agent.”

That gives us the core mechanism:

Workflow decision	Bad production instinct	Better mechanism in the paper
External system access	Let the agent interpret broad MCP/tool metadata	Use direct tool calls or pure functions when possible
Infrastructure actions	Ask the LLM to create commits, files, pull requests, or API writes	Put these steps in deterministic orchestration code
Agent scope	Give one agent many tools and responsibilities	Use single-tool and single-responsibility agents
Prompt management	Hide prompts inside source code	Store prompts externally and load them at runtime
Model reliability	Trust one model’s answer	Use multiple model drafts and a reasoning consolidator
Workflow exposure	Embed workflow logic inside the MCP server	Keep MCP as a thin adapter over a REST-backed workflow
Deployment	Run scripts manually or semi-manually	Containerize and deploy through Docker/Kubernetes

The principle underneath the table is simple: uncertainty should be budgeted. Spend it where semantic interpretation is required. Do not spend it on plumbing.

The case study is a media pipeline, but the mechanism is enterprise automation

The paper demonstrates its guidance through a podcast-generation workflow. The domain sounds lightweight: collect news, generate a podcast script, produce audio and video artifacts, and publish the result. But architecturally, this is a useful stress test because it combines retrieval, filtering, scraping, multi-model generation, reasoning consolidation, multimodal output, and software publishing.

The workflow starts with user-provided topics and source URLs. A search agent collects recent updates, a filtering agent selects relevant articles, and a scraping agent converts selected pages into clean Markdown. That Markdown becomes the grounding material for the generation stage.

Then the system calls multiple podcast script generation agents, each powered by a different LLM family. The paper’s evaluation section describes a consortium composed of Llama, OpenAI, and Gemini agents. The authors observe that these models produce different styles: Llama tends to be concise and structured, OpenAI more detailed and narrative-driven, and Gemini more stylistically flowing and contextual. That diversity is useful, but it also introduces emphasis drift and factual variation.

So the workflow adds a reasoning agent. Its job is not to create from nothing. It compares the drafts, resolves inconsistencies, removes speculative claims, and produces a consolidated final script. After that, other components convert the script into audio/video instructions, including Veo-3 JSON, text-to-speech output, video generation, and GitHub publishing.

This is where the case study becomes more than “AI makes a podcast.” The workflow contains nearly every failure mode that enterprise automation teams will recognize:

Stage	What can go wrong	Stabilizing mechanism
Search and filtering	Irrelevant or stale inputs enter the pipeline	Specialized retrieval and filtering agents
Scraping	Raw HTML or messy web content contaminates downstream generation	Markdown conversion before generation
Script drafting	Models emphasize different facts or styles	Multi-model consortium
Consolidation	Drafts conflict or drift beyond the source material	Reasoning agent with cross-model comparison
Video prompt generation	Natural language mixes with required JSON	Dedicated JSON builder agent
Media/API execution	Agents hallucinate file paths or status messages	Non-agent execution functions
Publishing	MCP/tool ambiguity breaks the final operational step	Direct function invocation for GitHub PR creation

The paper’s strongest contribution is not that this specific podcast system exists. It is that the authors use the system to show where agentic discretion should be narrowed.

MCP is useful as an interface, not as a place to hide workflow logic

A casual reader might expect the paper to be pro-MCP in the usual breathless way: standardize everything, connect everything, let agents discover tools everywhere, and enjoy the future until the stack trace arrives.

The paper is more pragmatic.

The authors initially used a GitHub MCP server to create pull requests for generated podcast artifacts. During evaluation, they observed recurring issues: ambiguous tool selection, inconsistent parameter inference, and non-deterministic MCP responses. Their diagnosis is important. The failure was not merely that MCP was “bad.” The problem was that the agent had to reason through tool metadata and choose the correct invocation pathway for an infrastructure action that did not actually require language reasoning.

The fix was not another elaborate prompt. They replaced that step with direct pull-request creation logic.

Then they went further. Even a dedicated PR agent using a structured tool still required the model to reason about parameters and produce a tool call. The authors eventually removed the PR agent and invoked create_github_pr directly from the workflow controller. That is the paper’s philosophy in miniature: when the desired behavior is deterministic, do not wrap it in probabilistic interpretation just because the architecture diagram looks more agentic.

But the paper does not reject MCP. In the implementation section, the authors expose the workflow backend through a REST API and build a corresponding MCP server so MCP-enabled clients can invoke the workflow. The difference is architectural placement. MCP becomes an adapter layer, not the owner of business logic.

That distinction matters for business systems.

A finance team building an AI reporting workflow, a compliance team automating evidence collection, or a legal operations team generating document summaries may still want MCP-style interoperability. But they should not bury operational logic inside the MCP server or ask an LLM to decide how to perform deterministic back-office actions. The workflow engine should remain testable, observable, and version-controlled. The MCP server should translate access, not absorb responsibility.

This is the boring answer. It is also the one that survives contact with production.

“One agent, one job” is not aesthetic minimalism; it is failure containment

The paper’s agent design recommendations may sound like ordinary software-engineering hygiene: avoid overloading agents, use single-responsibility components, keep workflows simple. That familiarity is exactly the point. Agentic AI is not exempt from software discipline because the components can write fluent English.

The authors give a concrete example. An early workflow design used one agent with two tools: scrape_markdown and publish_markdown. The idea was straightforward: scrape content, then publish the extracted Markdown for audit purposes. In practice, the agent sometimes called only one tool, called tools in the wrong order, or failed to call them entirely, especially as prompt or input size increased.

The repair was simple: split the design into two independent agents, each responsible for one tool.

The same pattern appears in the video-generation section. An early design asked one agent to transform the final script into a Veo-3 JSON specification and also generate the corresponding video. That mixed two very different responsibilities: planning a structured video prompt and executing an external media-generation process. The model sometimes produced malformed JSON, mixed prose with JSON, or hallucinated file paths and status messages.

The fix was again decomposition. A Veo JSON Builder Agent produces valid JSON. A separate non-agent function receives that JSON, calls the video API, handles retries, saves the file, and reports status.

The operational lesson is sharper than “modularize your agents.” It is this:

Each extra responsibility given to an LLM expands the space of possible failure modes.

A traditional function with two responsibilities may be ugly. An LLM agent with two responsibilities may be unstable. The difference is that the LLM must infer intent, maintain context, select output format, and decide when to call tools. Every additional duty increases cognitive load inside a component that is already probabilistic.

That is why “one agent, one job” is not just clean architecture. It is a risk-control mechanism.

Prompt governance belongs outside the codebase’s nervous system

The paper also recommends storing prompts externally and loading them at runtime. This may look like a small implementation detail. It is not.

In agentic systems, prompts are not comments. They are executable policy. They define role boundaries, output formats, safety rules, escalation behavior, source-grounding expectations, and sometimes the difference between a stable workflow and a polite disaster.

Embedding prompts directly inside source code creates two problems. First, prompt changes become code changes, which slows iteration and makes business review awkward. Second, prompt history becomes harder to govern. In regulated or client-facing workflows, it matters which instruction version produced which output.

The authors store prompts in a dedicated GitHub repository and load them dynamically. That enables review, version pinning, rollback, and controlled access. It also lets non-engineering stakeholders participate in prompt refinement without editing workflow code.

For business teams, this is the bridge between “AI experimentation” and “AI operations.” A prompt repository can support:

Prompt operation	Business reason
Versioning	Know which instruction set generated an output
Review workflow	Let domain owners approve behavior changes
Rollback	Recover quickly when a prompt revision causes drift
A/B testing	Compare prompt variants without redeploying core code
Red-team updates	Add safety constraints as failure modes are discovered
Access control	Separate prompt authors, reviewers, and deployers

This is not glamorous. It is also the sort of thing that prevents a client-facing AI system from silently changing behavior on a Tuesday because someone “improved the prompt.”

Multi-model reasoning is useful, but not magic truth serum

The paper’s Responsible-AI mechanism is a model consortium followed by a reasoning agent. In the podcast workflow, different LLMs independently draft scripts from the same scraped content. The reasoning agent then compares those outputs, identifies shared facts, handles conflicts, removes speculative claims, and produces a consolidated script.

This is a sensible pattern. It creates diversity at the draft stage and compression at the final stage. The system harvests variation where variation is useful, then narrows the output before publication.

But the evidence should be read carefully. The evaluation section is mainly qualitative and implementation-oriented. It shows representative outputs from the script generation agents, the reasoning agent’s consolidation prompt, a consolidated script, video-script output, and Veo-3 JSON output. The authors report that the Veo-3 JSON builder consistently produced well-formed JSON across multiple runs, but the paper does not present a broad statistical benchmark, cross-domain evaluation, or quantified reliability comparison.

That does not make the evaluation useless. It tells us what kind of evidence it is.

Paper element	Likely purpose	What it supports	What it does not prove
Podcast-generation case study	Main implementation evidence	The proposed practices can be assembled into a working end-to-end workflow	That the architecture is optimal across industries
MCP-to-direct-function examples	Design comparison inside the case	Removing LLM interpretation from deterministic steps can reduce observed instability	A universal rejection of MCP
Multi-model script outputs	Exploratory demonstration of model diversity	Different models produce different emphasis and style from the same source material	That consensus always improves factual accuracy
Reasoning-agent consolidation	Main workflow mechanism	A dedicated consolidator can reconcile drafts and reduce speculation within this pipeline	That consolidated output is automatically true
Veo-3 JSON builder examples	Implementation-detail validation	Narrow agents with strict output contracts can produce usable structured artifacts	That all schema-generation tasks will be robust without further testing
Docker/Kubernetes deployment	Operational implementation evidence	The workflow can be packaged and deployed in cloud-native form	That containerization alone ensures AI reliability

The useful business inference is not “use three models and you are safe.” The better inference is: when outputs are subjective or generative, use structured comparison and consolidation before taking action. Consensus is not truth. But it is often a better control surface than a single model’s unchecked confidence.

For high-stakes settings, the consolidator should still be paired with source verification, logging, human review thresholds, and domain-specific evaluation. The paper’s architecture points in that direction; it does not remove the need for those controls.

The real workflow pattern is variation, consolidation, determinism

The paper’s mechanism can be summarized as a three-zone pipeline:

[Variation Zone]        [Consolidation Zone]        [Deterministic Zone]
LLM drafts              Reasoning agent             Functions, APIs, storage,
multi-model outputs  -> conflict resolution      -> GitHub PRs, files,
semantic judgment        source-grounded synthesis    deployment, observability

This pattern is more useful than a checklist because it tells teams where to place intelligence.

In the variation zone, multiple models or agents can produce candidate outputs. This is valuable when the task benefits from language diversity: drafting, interpretation, summarization, explanation, scenario generation, and creative planning.

In the consolidation zone, a reasoning agent narrows those candidates. It compares, reconciles, filters, and structures. The prompt used for this role matters because the agent must not simply average the drafts. It must prefer shared facts, drop speculative or conflicting details, and acknowledge uncertainty rather than inventing closure.

In the deterministic zone, ordinary software takes over. File operations, API calls, publishing, schema validation, retries, observability, deployment, and access control should be boring. The fewer interpretive decisions here, the better.

Many failed agentic systems invert this pattern. They use one model to produce a final answer, then let agents improvise the operational steps afterward. That is backwards. Let models explore and reason while the cost of variation is still contained. Once the system reaches execution, narrow the pathway.

Deployment is not an afterthought; it is part of the AI design

The implementation section matters because it pulls the workflow out of notebook-land. The authors use the OpenAI Agents SDK, expose the workflow through a REST API, build a separate MCP server as an adapter, release workflow and MCP-server repositories, include Dockerfiles and Kubernetes deployment manifests, and test MCP interoperability through LM Studio.

This is not just DevOps decoration. Agentic workflows need deployment discipline because their failure modes are not limited to bad answers. They include retry storms, tool-call failures, API quota problems, inconsistent model behavior, prompt drift, dependency mismatch, missing secrets, unhealthy containers, and silent partial completion.

Containerization addresses part of that problem by creating reproducible runtime environments. Kubernetes adds scaling, restarts, health checks, isolation, and deployment patterns such as blue-green or canary releases. Observability tools can track workflow execution across agent stages.

Still, deployment infrastructure does not make a bad agent design good. It makes a disciplined design operable. That distinction matters. Kubernetes can restart a failed pod. It cannot decide whether your agent should have been allowed to both generate JSON and pretend it created a video file.

The business value is not “we used Kubernetes.” The value is that the AI workflow becomes inspectable and maintainable enough to enter the same operational universe as other enterprise software.

What businesses should take from the paper

The paper directly shows a full implementation of a multimodal news-to-podcast workflow built with specialized agents, external tools, multi-model generation, reasoning consolidation, REST/MCP separation, and containerized deployment. It also reports specific design failures observed during development, especially around MCP/tool ambiguity, overloaded agents, and mixed planning/execution responsibilities.

From that, Cognaptus would infer a practical operating model for enterprise agentic AI:

Design choice	Direct paper basis	Business interpretation
Replace unnecessary agent calls with pure functions	GitHub PR step moved from MCP/tool/agent logic to direct function invocation	Reduce cost, latency, ambiguity, and debugging burden
Use single-responsibility agents	Scraping/publishing and Veo JSON/video generation examples	Make failures local and contracts testable
Externalize prompts	Prompts stored separately and loaded at runtime	Treat prompts as governed operational artifacts
Use multi-model generation plus reasoning consolidation	Llama/OpenAI/Gemini drafts consolidated by reasoning agent	Improve review surface for generative outputs
Separate MCP from workflow logic	REST workflow backend plus MCP adapter	Preserve interoperability without sacrificing maintainability
Containerize the workflow	Docker/Kubernetes deployment in the implementation	Move agent systems toward normal production operations

The uncertain part is magnitude. The paper does not quantify cost savings, reliability improvement, latency reduction, or defect-rate changes across many workflows. A company should therefore treat the paper as an engineering blueprint, not a benchmark report.

That is still valuable. Many organizations do not fail at agentic AI because they lack a more powerful model. They fail because the workflow gives the model too many unbounded responsibilities and then calls the resulting instability “AI behavior.” Very elegant. Also very avoidable.

Where the paper should not be overread

There are three boundaries worth keeping clear.

First, the evidence is mostly a detailed case study and qualitative evaluation. The authors show how the architecture works and where design changes improved stability in their workflow, but they do not provide a broad cross-industry benchmark. The right conclusion is “this is a practical architecture worth adapting,” not “this proves universal superiority.”

Second, the Responsible-AI claims are architectural rather than final. A multi-model consortium and reasoning agent can reduce some risks, especially idiosyncratic model drift and unsupported speculation. But multiple models can still share blind spots, and a reasoning agent can still consolidate plausible falsehoods if the source material is weak. Consensus is a control mechanism, not a certificate of truth.

Third, MCP is not the enemy. The paper’s more precise lesson is that MCP should be placed carefully. Use it for interoperability and client access. Avoid forcing the LLM to reason through broad MCP tool metadata for deterministic infrastructure tasks. The adapter is useful. The adapter should not become the brain.

The assembly line is the point

The best agentic workflows will not look like one giant agent with a heroic prompt. They will look like disciplined production systems: narrow stations, explicit contracts, controlled handoffs, source-grounded reasoning, deterministic execution, versioned prompts, and observable deployment.

That may disappoint people who wanted agentic AI to mean “let the model figure everything out.” But in business systems, letting the model figure everything out is often just a premium subscription to randomness.

The paper’s real contribution is to move agentic AI from performance art toward engineering practice. It shows that production-grade agents are built not by maximizing autonomy everywhere, but by deciding where autonomy belongs.

The assembly line metaphor is useful because it is not romantic. Each station does its job. The uncertain work is isolated. The final steps are controlled. The output can be inspected. The system can be improved without praying over a monolithic prompt.

That is how agentic AI becomes infrastructure.

Not by making every component intelligent.

By making the whole workflow less foolish.

Cognaptus: Automate the Present, Incubate the Future.

Eranga Bandara et al., “A Practical Guide for Designing, Developing, and Deploying Production-Grade Agentic AI Workflows”, arXiv:2512.08769, 2025. ↩︎

The production problem is not intelligence; it is uncontrolled discretion#

The case study is a media pipeline, but the mechanism is enterprise automation#

MCP is useful as an interface, not as a place to hide workflow logic#

“One agent, one job” is not aesthetic minimalism; it is failure containment#

Prompt governance belongs outside the codebase’s nervous system#

Multi-model reasoning is useful, but not magic truth serum#

The real workflow pattern is variation, consolidation, determinism#

Deployment is not an afterthought; it is part of the AI design#

What businesses should take from the paper#

Where the paper should not be overread#

The assembly line is the point#