The demo is easy. The DAG is not.

Pipeline automation has a wonderfully deceptive user story.

A business analyst writes: “Take this customer file, clean the locations, geocode the addresses, add weather data, then save the enriched output.” An LLM replies with a Python file. The file looks plausible. There are imports. There is an Airflow DAG. There are operators. There are dependencies. A demo audience nods approvingly.

Then Airflow tries to load it.

That is where the theatre ends. The generated script may have a missing Docker image, a malformed dependency chain, a task that references a non-existent upstream node, an environment variable that exists only in the model’s imagination, or just half a file because the model wandered off before finishing. The code was fluent. The workflow was not.

That gap is the point of Prompt2DAG, a paper that studies how to turn natural-language descriptions into executable Apache Airflow DAGs for data enrichment pipelines.1 Its useful lesson is not “LLMs can write pipeline code.” We have seen enough of that particular magic trick. The useful lesson is more operational: reliable workflow generation depends less on asking a smarter model for a better heroic answer, and more on forcing the problem through structured representations, templates, and validation gates.

In other words: prompt less like a poet, compile more like an engineer. Tedious? Slightly. Effective? Annoyingly, yes.

The real comparison is between four pipeline factories

Prompt2DAG evaluates four ways of moving from natural language to Airflow code. The comparison matters because each method embodies a different theory of how much structure an LLM needs.

Method How it works What it assumes Operational personality
Direct Raw natural-language description goes straight into a one-shot request for a full Airflow DAG. The model can infer requirements, structure, dependencies, and platform details in one leap. Fast, seductive, and brittle.
LLM-only modular The system first analyzes the pipeline into structured artifacts, then uses LLM generation for the final DAG. Decomposition improves reasoning, but the model can still synthesize the executable code reliably. Better organised, still exposed at the code-generation edge.
Hybrid Structured analysis and workflow specification are combined with template-guided LLM code generation. Templates should carry the stable architecture; the model should fill context-specific details. The pragmatic middle child. Less glamorous, more useful.
Templated Deterministic template expansion generates the final DAG from the structured workflow specification. The workflow pattern is known well enough to encode in templates. Most reliable, least flexible. Excellent when the world behaves. Which it rarely does.

This is why a comparison-based reading of the paper is more useful than a normal summary. The business decision is not “Should we use an LLM?” It is “Where should probabilistic generation be allowed to operate, and where should deterministic scaffolding take over?”

Prompt2DAG’s answer is clear: let the LLM interpret intent and handle variation, but do not let it improvise the skeleton of an executable workflow from scratch unless you enjoy debugging as a lifestyle.

Prompt2DAG behaves more like a compiler than a chatbot

The methodology has four stages.

First, the natural-language description is converted into a structured JSON analysis. This analysis captures the execution environment, components, parameters, data flow, external integrations, and parallel patterns. The paper’s appendix makes this concrete: separate prompts are used for environment inference, component identification, flow structure, parameter extraction, integration analysis, and report generation. This is not a single “write me a DAG” prompt wearing a fake moustache. It is decomposition.

Second, that JSON artifact is transformed into a platform-neutral YAML workflow specification. This matters because the YAML is not merely prettier JSON. It is an intermediate representation: more readable, easier to review, and closer to execution semantics. It separates “what the pipeline means” from “how Airflow happens to express it.”

Third, executable Airflow code is generated. Here the methods diverge. The LLM-only path asks the model to synthesize the DAG. The Hybrid path uses templates as structural scaffolding while retaining LLM flexibility for task-specific code and configuration. The Templated path removes LLM code synthesis almost entirely.

Fourth, the generated DAG is evaluated automatically. The evaluation is unusually important because the paper does not merely admire generated code. It punishes code that cannot load. Non-loadable DAGs receive zero scores across the main metrics, which is exactly the kind of cruelty production systems tend to appreciate.

The evaluation rewards usable DAGs, not pretty failures

The study uses five data-enrichment scenarios and thirteen LLMs, producing 260 generation attempts across the four methods. The scenarios are not random toy examples. They cover common enrichment patterns: digital marketing pipelines using geocoding and weather APIs, supplier validation using knowledge-base reconciliation, and multilingual product review analysis using language detection and sentiment or feature extraction. The Digital Marketing variants also test sequential processing, pipeline-level parallelism, and task-level concurrency.

The target execution model is specific: Apache Airflow DAGs, Docker-based operators, shared volumes, managed Docker networks, and pre-built containerized components. That boundary matters. This is not a universal proof about all workflow automation. It is a controlled study of natural-language-to-Airflow generation for tabular enrichment pipelines.

The main metrics are:

  • SAT, a static code analysis score covering style, security, complexity, and general Python quality.
  • DST, a DAG structure and configuration score covering acyclicity, connectedness, operator configuration, and dependency correctness.
  • PCT, a platform conformance score covering Airflow loadability and task dry-run behaviour.

The hard gate is loadability. If Airflow’s DagBag cannot import the DAG, the artifact is operationally dead. The paper’s scoring reflects that reality by assigning zero downstream scores to non-loadable outputs.

That design choice is not a detail. It is the difference between measuring “how nice the code looks when it works” and measuring “how often this method gives operations something that can actually enter the building.”

Hybrid wins the generative race; Templated sets the ceiling

The aggregate result is the centre of the paper.

Method SAT DST PCT Success rate
Direct 2.53 2.59 2.79 29.2%
LLM-only 5.78 5.95 6.44 66.2%
Hybrid 6.79 7.67 7.76 78.5%
Templated 7.80 9.16 9.22 92.3%

The first obvious reading is that Hybrid beats Direct and LLM-only. True, but incomplete.

The sharper reading is that reliability is doing most of the work. The paper reports that successful Direct outputs can still have respectable non-penalized quality. The problem is that too few of them succeed. Direct prompting is not mainly bad because every line of code is stupid. It is bad because the method too often produces unusable artifacts.

That distinction matters for business adoption. A leader looking at a demo might see one successful Direct output and conclude that the approach is nearly ready. The paper’s aggregate numbers say otherwise. A method that produces good code 29.2% of the time is not “almost there.” It is a roulette wheel with syntax highlighting.

The Templated method performs best overall, reaching 92.3% success with the strongest structural and conformance scores. That is unsurprising. Deterministic templates are excellent at repeating known patterns. They keep imports, operators, task groups, dependencies, mounts, and command formats under control.

But Templated generation also comes with the familiar cost: someone must design, maintain, and extend those templates. When requirements are stable, that cost is worth paying. When pipeline patterns vary frequently, pure templates become a different kind of bottleneck. The engineering burden moves from debugging generated code to curating the template universe.

Hybrid is the more interesting answer because it closes much of the gap to deterministic generation while preserving more adaptability. It does not pretend the LLM should own the whole artifact. It gives the model a fenced yard.

The boring middle layer is where the value lives

The popular mental model of LLM automation is still too output-centric. Ask question, receive artifact, inspect artifact. Prompt2DAG pushes attention upstream.

The key move is the intermediate representation: natural language becomes structured JSON; structured JSON becomes neutral YAML; YAML becomes executable DAG code. Each transformation narrows ambiguity. Each artifact can be checked before the next one is generated. Each stage gives humans and automated validators something to inspect.

For enterprise teams, that is the business value. The system is not merely “generating Airflow code.” It is creating a pipeline of accountable artifacts:

Business intent
Structured pipeline analysis
Platform-neutral workflow specification
Template-guided executable DAG
Automated quality and conformance checks

This pattern should feel familiar. Mature software teams do not usually let production infrastructure emerge directly from prose. They use schemas, configuration files, compilation steps, tests, CI gates, and deployment policies. Prompt2DAG’s contribution is to show that LLM workflow generation benefits from the same old-fashioned discipline. Shocking development: software engineering still exists.

The Hybrid method’s advantage comes from placing model flexibility where it is useful and deterministic structure where failure is expensive. The LLM can help interpret vague requirements, map business language onto component types, and fill variable implementation details. Templates can enforce the shape of the Airflow artifact.

That division is the lesson. Autonomy improves when it is boxed intelligently.

The secondary tests explain why the headline result happens

The paper includes several analyses beyond the main performance table. They should not all be treated as equal evidence. Some are main evidence, some are diagnostic, and some are operational economics.

Evidence item Likely purpose What it supports What it does not prove
Aggregate SAT/DST/PCT and success rates Main evidence Hybrid is the best generative method; Templated is the reliability ceiling. That the same ranking holds for every workflow type.
Cross-domain comparison of Hybrid and Templated Robustness across the five case studies Hybrid remains stable across sequential, parallel, task-parallel, procurement, and multilingual review cases. Generality to streaming, ML training, or heavy conditional branching.
Model-specific performance table Sensitivity to model choice Model selection matters sharply; larger or stronger models generally fare better, but structure still matters. That benchmark scores alone predict production success.
File-size analysis Exploratory failure diagnosis Failed artifacts tend to be smaller, suggesting incompleteness or truncation is a common failure mode. That every failure is caused by truncation.
Step 1 semantic-fidelity assessment Diagnostic analysis of comprehension Missing information in early analysis correlates with later failure. That an LLM judge perfectly measures semantic correctness.
Token usage and cost-per-success analysis Operational cost assessment Hybrid costs more per attempt but less per successful generative DAG than Direct. Full total cost of ownership, including template maintenance and human review.

The semantic-fidelity result is especially useful. Failed runs averaged more total analysis issues than successful ones, and missing information was the strongest negative correlate of success. In plain terms: omission hurts. If the analysis phase forgets a component, parameter, integration, or dependency, the final DAG may never recover.

That finding gives teams a practical control point. Improve the first structured analysis step, and downstream reliability should improve. For Hybrid and LLM-only methods, Step 1 is not documentation theatre. It is a leading indicator.

Direct prompting behaves differently. The paper notes a counterintuitive pattern: Direct failures can still show relatively high correct identification. That suggests the Direct method’s problem is not always comprehension. Sometimes the model understands enough and still fails to produce an executable multi-component artifact in one shot.

That is the difference between knowing the recipe and running the kitchen.

The cost story flips once failures are counted

At first glance, Hybrid looks expensive. It consumes more tokens per successful run attempt than Direct or LLM-only because it spends extra budget in the DAG-generation phase.

The paper reports average token use per successful run as follows:

Method Average tokens per attempt Success rate Approximate tokens per successful DAG
Direct 17,221 29.2% 58,975
LLM-only 17,572 66.2% ~26,500
Hybrid 20,091 78.5% 25,588
Templated 15,261 92.3% ~16,500

The lesson is simple: per-attempt cost is the wrong denominator when failure rates are high.

Direct looks cheaper until you ask how many attempts are needed to obtain a usable DAG. Once failure is priced in, Hybrid becomes the most cost-effective generative method. It spends more to avoid wasting more. A familiar enterprise bargain, though usually with more procurement paperwork.

Templated generation still has the best token economics, but token economics are not total economics. Template authoring and maintenance are real costs, even if they do not show up in the token ledger. If a company has a small number of stable, repeated enrichment patterns, Templated generation may be the right end state. If the company faces frequent variation, Hybrid is more plausible as the operating default.

The business interpretation: build a governed workflow compiler

Here is what the paper directly shows: in this experimental setting, structured and template-guided generation produces more reliable Airflow DAGs than one-shot prompting, and Hybrid offers the best balance among generative methods.

Here is what Cognaptus would infer for business use: AI-assisted pipeline generation should be designed as a governed workflow compiler, not a chat-to-code toy. The user-facing experience can still begin with natural language. But the backend should force that language through typed schemas, platform-neutral workflow specifications, template scaffolds, and automated evaluation.

That has several practical consequences.

First, enterprises should catalogue reusable pipeline components before asking for broad natural-language automation. Prompt2DAG assumes pre-built containerized services: loaders, reconcilers, enrichers, exporters, splitters, and mergers. The method works because the system has building blocks. Without those blocks, the LLM is not generating a workflow; it is inventing infrastructure fan fiction.

Second, teams should treat the structured analysis artifact as a review surface. Business users can inspect whether the system captured the intended data sources, APIs, parameters, credentials, outputs, and parallelism. Data engineers can inspect whether the resulting workflow is plausible before executable code exists. This is where human-in-the-loop review becomes useful instead of ceremonial.

Third, evaluation should be wired into CI/CD. SAT, DST, and PCT are not perfect universal metrics, but the discipline is correct: generated workflows should be linted, structurally checked, loadability-gated, and dry-run tested. Non-loadable outputs should count as failures, not “almost successes.”

Fourth, template strategy should be tiered. Mature, high-volume, low-variance patterns should move toward deterministic templates. Variable but bounded workflows should use Hybrid. Direct prompting should remain where it belongs: prototypes, experiments, and demos that everyone promises not to deploy, right before someone deploys them.

Where the result applies—and where it does not

The paper’s boundaries are important because they prevent the wrong kind of excitement.

Prompt2DAG is tested on five data enrichment case studies. These are meaningful enterprise patterns, but they are not the whole data pipeline universe. The study does not prove the same performance on large-scale ELT warehouses, streaming pipelines with strict latency requirements, ML training workflows, lineage-heavy orchestration, or pipelines with extensive conditional branching and advanced retry logic.

The implementation target is also specific. The generated assets are Apache Airflow DAGs using Docker-based components in a standardized execution environment. If an organization runs Dagster, Prefect, Argo Workflows, Step Functions, or a heavily customized Airflow estate, the same principles may transfer, but the measured numbers should not be copy-pasted into a business case.

The cost analysis is useful but incomplete. Tokens are a clean comparison unit, not a full cost model. API pricing, local hosting, engineering labour, template maintenance, review time, incident cost, and organizational coding standards all affect deployment economics.

Finally, the semantic-fidelity assessment uses an LLM judge. That is scalable and structured, but not equivalent to a full expert audit. It is a useful diagnostic, not a papal decree.

These limitations do not weaken the main lesson. They sharpen it. The paper does not say “LLMs now automate data engineering.” It says that if you constrain the domain, standardize components, introduce intermediate representations, and validate aggressively, LLMs can become useful participants in workflow generation.

That is less glamorous than full autonomy. It is also much closer to something a real organization could run without immediately summoning the incident channel.

The management takeaway: autonomy needs scaffolding

Prompt2DAG is best read as a warning against hero prompts.

The Direct method represents the dream: describe the work, receive the workflow, move on with your life. The results make that dream look premature. Direct prompting succeeds often enough to impress in a meeting and fails often enough to be dangerous in production. A truly inconvenient combination.

The LLM-only modular method is a meaningful improvement because it decomposes the problem. But it still leaves too much structural responsibility to probabilistic code generation.

Hybrid is the practical answer: let the model reason through the variable parts, but let templates and validators carry the load-bearing structure. Templated generation is the reliability target for mature patterns, but Hybrid is the more adaptable bridge for organisations still discovering and standardizing their workflows.

For data leaders, the question is not whether natural-language pipeline generation is possible. It is what kind of system has to exist around the model before the output deserves trust.

The answer is not a bigger prompt. It is a pipeline for making pipelines.

Cognaptus: Automate the Present, Incubate the Future.


  1. Abubakari Alidu, Michele Ciavotta, and Flavio De Paoli, “Prompt2DAG: A Modular Methodology for LLM-Based Data Enrichment Pipeline Generation,” arXiv:2509.13487, https://arxiv.org/abs/2509.13487↩︎