Logs are where teams go after the dashboard has already failed.
A pipeline stalls. A model run produces nonsense. A compute job quietly burns budget on the wrong node. Someone opens three dashboards, two notebooks, and one ancient SQL snippet named final_debug_v3_really_final.sql. Then the archaeology begins.
The paper LLM Agents for Interactive Workflow Provenance: Reference Architecture and Evaluation Methodology proposes a more interesting answer: do not ask an LLM to “understand the workflow” in the abstract. Give it live provenance metadata, a compact schema, query guidelines, and tools that execute structured queries on its behalf.1 In other words, stop treating the model as a psychic dashboard. Treat it as a controlled interface to workflow exhaust.
That distinction matters. The paper is not saying that LLMs can magically chat with terabytes of provenance data because they have read enough Stack Overflow and developed a soul. It is saying something narrower, more practical, and therefore much more useful: an agent can answer operational questions if the system hides raw scale behind the right abstractions.
The machine is not intelligence alone. It is provenance capture, streaming, schema management, prompt context, query generation, tool routing, database access, and auditability. The LLM is only one part of the mechanism. Naturally, it will still receive most of the attention, because we live in a society that believes the frosting baked the cake.
The actual innovation is architectural, not conversational
The easiest way to misunderstand the paper is to focus on the chat interface. A scientist asks a natural-language question. The agent returns a table, plot, or summary. Charming. Familiar. A bit suspicious.
But the paper’s real contribution sits behind that interface. The authors propose a modular architecture for an LLM-powered provenance agent that can interact with workflow metadata while the workflow is running. The architecture is designed for scientific workflows across the edge-cloud-HPC continuum, but the pattern generalises to enterprise data pipelines, AI training workflows, ETL systems, simulation stacks, and any environment where the answer to “what just happened?” is currently split across logs, queues, databases, and institutional folklore.
At a simplified level, the mechanism looks like this:
Instrumented workflow tasks
↓
Provenance messages: inputs, outputs, timings, telemetry, host, status
↓
Streaming hub
↓
Context Manager
↓
In-memory recent task buffer + Dynamic Dataflow Schema + Guidelines
↓
LLM generates structured query or tool intent
↓
Tool Router executes against buffer or provenance database
↓
Table, plot, summary, or anomaly tag
↓
The tool call and LLM interaction are themselves recorded as provenance
That last line is not decorative. If the agent’s analysis is also captured as provenance, the system can audit not only the workflow but the interpretation of the workflow. In regulated, scientific, or high-cost compute settings, this is the difference between “the agent said so” and “here is the trace of how the answer was produced.” The former is a chatbot. The latter is an analysis instrument with a paper trail.
The system uses the Model Context Protocol as a way to expose tools and resources to the agent. The paper’s architecture separates context management, prompt generation, LLM service calls, tool dispatch, and provenance data access. That separation is not mere software-engineering hygiene. It is what lets the agent remain useful without coupling every question to the full provenance database.
The model sees the map, not the warehouse
The paper’s key design choice is almost boring, which is often where the valuable engineering lives: the LLM does not ingest the raw provenance database.
Instead, the system maintains a Dynamic Dataflow Schema. This schema is inferred incrementally from incoming provenance messages. It records workflow activities, available input and output fields, inferred types, example values, and common task metadata such as workflow IDs, campaign IDs, activity names, timestamps, hostnames, telemetry, and status fields.
That compact schema becomes the agent’s working map.
This design solves three practical problems at once.
First, it keeps prompts within context limits. Large workflow runs can generate gigabytes or terabytes of provenance. Even generous context windows do not make “paste the database into the model” a sane architecture. Context windows grow; production data grows faster. One of them also costs money.
Second, it limits exposure of raw records. Many organisations would prefer not to ship detailed operational data, sensitive parameters, or proprietary outputs to an external LLM service. A metadata-first approach does not eliminate privacy concerns, but it reduces the blast radius.
Third, it makes accuracy depend more on workflow structure than on raw volume. The agent’s LLM interaction scales with the number and diversity of activities and fields, not with every individual task record. This is the central business-relevant trick: the system turns a data-volume problem into a schema-and-semantics problem.
That trade-off is attractive, but not free. If field names are meaningless, instrumentation is sloppy, or domain semantics are hidden inside undocumented code, the schema will faithfully summarise confusion. The model cannot infer business intent from x2_final_tmp unless the surrounding system gives it something better to work with. Despite rumours from LinkedIn, abbreviations are not ontology.
Guidelines are the underrated control surface
The paper’s evaluation shows that context helps, but not all context helps equally.
The authors test prompt and RAG configurations that build up from a minimal baseline toward a full setup containing role instructions, task descriptions, output formatting, few-shot examples, the dynamic dataflow schema, domain values, and query guidelines. The important result is not merely that “more context improves performance.” That would be unsurprising and not especially useful.
The important result is that query guidelines deliver a large accuracy gain at relatively low token cost.
In the reported ablation, average scores rise from 0.06 in the baseline to 0.97 in the full configuration. The largest jump occurs between a configuration using few-shot examples plus schema and one using few-shot examples plus guidelines: from 0.56 to 0.92. Token usage also rises sharply across richer configurations, from 293 tokens to more than 4,300. Schema descriptions and domain values are useful but expensive; guidelines and examples are comparatively efficient.
That is a very practical finding. Many enterprise AI deployments respond to errors by adding more context. More documentation. More fields. More screenshots. More “please be careful” incantations. The paper suggests a better first move: encode query conventions.
For example:
| Agent problem | Weak response | Stronger control surface |
|---|---|---|
| User asks about time ranges | Hope the model chooses the right timestamp | Guideline: use started_at for start-time filtering |
| User uses domain shorthand | Hope the model maps jargon to fields | Guideline: map “learning rate” to lr |
| User asks for “latest” | Hope the model sorts correctly | Guideline: sort by timestamp, not by task ID |
| User asks for a plot | Hope chart logic matches analytic intent | Guideline: specify grouping and aggregation rules |
This matters because many production failures are not caused by the model lacking facts. They are caused by the model choosing the wrong convention. Guidelines are where those conventions become executable.
The evaluation is a development method, not a leaderboard
The paper compares several LLMs: LLaMA 3 variants, GPT-4, Gemini 2.5 Flash Lite, and Claude Opus 4. It also uses GPT and Claude as judges to score generated queries against human-written gold-standard DataFrame code. The query set contains 20 natural-language questions, evenly split between OLTP-style lookups and OLAP-style exploratory analysis, covering control flow, dataflow, scheduling, and telemetry.
It would be easy to turn this into a model-ranking story. That would be the less useful reading.
The authors explicitly use the evaluation methodology as an iterative design process. The query taxonomy exposes where the agent fails. The prompt/RAG configurations test which context components matter. Multiple judges expose scoring bias. Repeated runs and median scores reduce variation. The experiment is less “Which model wins?” and more “Which parts of the agent architecture actually improve query correctness?”
A useful evidence map looks like this:
| Paper component | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Synthetic workflow with 20 curated queries | Main evidence and development testbed | The agent can generate useful structured queries across query classes when context is designed well | Broad production reliability across messy enterprise systems |
| Prompt/RAG configuration comparison | Ablation | Guidelines, few-shot examples, schema, and domain values have different accuracy/token trade-offs | That the same exact prompt stack is optimal everywhere |
| GPT and Claude as judges | Robustness and evaluation sensitivity check | Judge rankings are broadly consistent, but absolute scores differ and self-preference appears | That LLM judging removes the need for human oversight |
| Model comparison across query classes | Diagnostic comparison | GPT and Claude perform strongly; smaller or more variable models struggle in specific ways | A universal model leaderboard for provenance agents |
| Chemistry workflow on Frontier | Real-world demonstration and exploratory extension | The architecture transfers from synthetic math workflow to a more complex scientific workflow | Full generalisation to all scientific or enterprise workflows |
| Response-time measurement | Implementation viability check | Responses remain interactive in the tested setup, around the order of seconds | Latency under extreme-scale, multi-user production load |
The strongest experimental message is consistent: top-tier models perform well under full context, OLTP queries are easier, OLAP questions are harder, and graph-like dataflow or control-flow reasoning remains a weak spot. Scheduling and telemetry look easier than nested dataflow interpretation. That pattern is exactly what one would expect if the model is better at filtering rows than reasoning through causal workflow structure. Which is to say: useful, not magical.
The judge results are also instructive. GPT as judge rated GPT and Claude almost identically, while Claude as judge slightly favoured Claude over GPT. The authors treat this appropriately: not as scandal, but as evidence that multi-judge evaluation is healthier than trusting one model to bless the outputs of another. The priesthood should at least have two priests.
The chemistry demo shows both transfer and fragility
The second use case is a computational chemistry workflow running on Frontier, focused on density functional theory calculations and bond dissociation energy analysis. The workflow takes a SMILES string, finds conformations, fragments molecules, runs calculations, submits HPC jobs, and generates thermodynamic properties and bond dissociation energies. Compared with the synthetic workflow, it has nested structures and domain-specific semantics.
This is where the architecture earns credibility. The agent, initially prototyped on a synthetic mathematical workflow, transfers to a more realistic chemistry workflow without bespoke domain prompt engineering. It answers live natural-language queries through a web interface, returning tables, plots, and summaries.
The examples are revealing because they are not all clean victories.
The agent correctly identifies the bond with the highest dissociation free energy and infers the relevant unit. It correctly reports the DFT functional used, though with a table that repeats values more than necessary. It correctly retrieves multiplicity and charge for the parent molecule and fragments, even enriching the answer with chemical terminology. It also generates a correct bar plot for bond dissociation enthalpy by bond label.
Then come the useful failures.
Asked for the lowest bond enthalpy, the agent identifies the right value but uses the wrong unit and omits the expected bond ID. Asked for the number of atoms in the parent molecule, it incorrectly sums atom counts across molecules and returns 81 instead of the parent’s 9 atoms. Asked to plot bond dissociation enthalpy with averaged C-H values, it fails to group and average before plotting, even though it later succeeds when asked directly for the average C-H bond dissociation enthalpy.
These failures are not random. They point to the hard boundary between schema-aware querying and domain-aware interpretation.
A schema can tell the agent that fields exist. Guidelines can tell it which timestamp or variable to prefer. Few-shot examples can teach query form. But when a user says “parent molecule,” “average C-H values,” or “lowest energy bond enthalpy,” the system must align natural-language intent with domain semantics, units, aggregation logic, and display expectations.
That is where enterprise readers should pay attention. In business workflows, the equivalent phrases are “active customer,” “net revenue,” “failed retry,” “latest approved version,” “same supplier,” and “material incident.” These are not just database fields. They are semantic contracts. Without those contracts, a provenance agent will sometimes produce beautifully formatted wrongness.
The business value is conversational lineage, not dashboard replacement
The practical business lesson is not “replace dashboards with chat.” Please do not make that slide. Someone will believe it.
The better interpretation is: add a governed conversational layer over lineage and observability metadata, especially where predefined dashboards cannot anticipate every diagnostic question.
Dashboards are excellent for known questions. Scripts are excellent for repeatable analysis. SQL is excellent when the analyst knows the schema and has time to explore it. The provenance agent is useful in the space between them: live exploratory diagnosis, audit queries, workflow monitoring, and questions that cross task history, input-output relationships, scheduling, and telemetry.
Here is the clean separation:
| Category | What the paper directly shows | Cognaptus business interpretation | What remains uncertain |
|---|---|---|---|
| Live workflow interaction | An agent can query recent runtime provenance using schema, guidelines, and structured query generation | Data and AI teams can reduce time spent manually digging through logs and ad hoc notebooks | Performance under messy, multi-team production environments |
| Metadata-driven scaling | LLM interaction depends on schema and guidelines, not raw provenance volume | Keep LLM costs and context exposure bounded by using metadata as the interface layer | Whether schema complexity grows gracefully in very heterogeneous stacks |
| Prompt/RAG design | Guidelines and few-shot examples deliver strong gains; schema and values help but cost more tokens | Operational conventions should be encoded as reusable guidelines, not buried in team memory | How to maintain guidelines automatically as workflows evolve |
| Model behaviour | GPT and Claude perform strongly; OLAP and graph-like reasoning remain harder | Use adaptive routing and fallback tools for complex analytic questions | Which routing policy works best in production |
| Chemistry workflow transfer | The same architecture works on a more complex real workflow with mostly correct or partially correct answers | The pattern can extend beyond toy pipelines when instrumentation is meaningful | Domain semantics, units, and aggregation rules still need governance |
For enterprises, the closest analogues are data lineage platforms, ML observability systems, incident diagnosis tools, experiment tracking systems, and audit infrastructure. A provenance-aware agent could answer questions such as:
- Which upstream task generated the bad batch?
- Which model training run used the deprecated feature table?
- Which node executed the failed GPU jobs?
- Which pipeline step changed output distribution after the last deployment?
- Which agent tool call produced the recommendation shown to the user?
- Which workflow used a parameter outside the approved range?
The interesting part is not that these questions are impossible today. They are possible, usually through a combination of SQL, dashboards, log search, notebooks, and senior engineers who have not taken a real holiday since 2019. The value is reducing the friction between question and traceable answer.
The agent should generate queries, not just prose
One subtle but important design choice is output format. The paper discusses result sets, structured queries, and summaries. The lightweight strategy is to have the LLM generate a query, then let the system execute it. This sounds less glamorous than asking the LLM to “answer directly,” but it is much safer.
A generated query can be inspected. It can be executed deterministically. It can fail visibly. It can be corrected. It can be logged as part of provenance. It can also be evaluated against a gold-standard query.
A prose answer, by contrast, may hide the mistake inside fluency. That is convenient only if the goal is to lose the audit trail while sounding confident. A popular enterprise pastime, admittedly, but not a good one.
The paper’s interface even displays generated code and runtime errors, allowing users to correct queries and convert recurring fixes into new guidelines. This is not yet seamless, but it is the right direction: user feedback becomes operational policy, not just a disappointed sigh into the chat box.
In a business setting, that suggests a practical design rule. Do not deploy provenance agents as answer machines. Deploy them as query-and-analysis machines with visible intermediate artefacts.
The system should show:
- the interpreted intent;
- the selected tool or data source;
- the generated query;
- the result;
- any assumptions or guidelines applied;
- the provenance record of the agent’s own action.
That may feel less magical. Good. Magic is hard to debug.
What to build before buying a bigger context window
The paper is also a useful antidote to a common procurement reflex: when the model struggles, buy a larger context window.
Sometimes that helps. Often it delays the architectural conversation.
A serious pilot inspired by this work would start with the data plumbing, not the model subscription tier:
| Layer | Practical design question |
|---|---|
| Provenance capture | Which workflow events, inputs, outputs, telemetry, hosts, statuses, and timestamps must be captured consistently? |
| Streaming | Will the system use Redis, Kafka, or another broker, and what latency/reliability trade-off is acceptable? |
| Storage | Which queries belong in an in-memory buffer, and which require persistent storage or graph traversal? |
| Dynamic schema | How will fields, types, examples, and workflow activities be summarised for the agent? |
| Guidelines | Which team conventions must become explicit query rules? |
| Tool routing | Which questions should hit recent context, historical databases, graph stores, dashboards, or anomaly detectors? |
| Audit | Are LLM calls, tool invocations, generated queries, and results themselves recorded as provenance? |
| Evaluation | What golden query set reflects the actual diagnostic questions users ask? |
The golden query set is especially important. The paper uses 20 curated questions across workload and data types. For an enterprise pilot, the equivalent should not be invented by the AI team alone. It should come from incident reviews, audit requests, recurring dashboard gaps, analyst notebooks, and the questions operators ask in Slack when the pipeline is on fire.
If the evaluation set contains only easy lookups, the pilot will look excellent and then disappoint everyone exactly when the system is needed. Include boring OLTP questions. Include ugly OLAP questions. Include ambiguous business terms. Include graph-like causality questions. Include unit and aggregation traps. The system should fail during evaluation, not during a board-reporting deadline.
The boundary is not scale alone; it is semantics
The paper’s limitations are refreshingly specific.
The experiments focus mainly on online queries over recent or active workflow runs using an in-memory context. The architecture supports offline querying, but deep graph traversal over persistent provenance databases remains future work. That matters because many of the most valuable enterprise questions are causal and multi-hop: “Which upstream change caused this downstream drift?” is not the same as “Show tasks with high memory.”
The agent’s performance also depends on meaningful instrumentation and code semantics. If workflows emit vague field names, inconsistent units, or poorly structured events, the dynamic schema will not rescue them. Provenance quality is still quality. The agent makes metadata easier to query; it does not absolve teams from designing metadata worth querying.
There is also the evaluation boundary. LLM-as-a-judge is scalable and useful, but it is not an oracle. The paper’s use of GPT and Claude as separate judges improves reliability, yet the judges show different absolute scoring patterns and mild self-preference. Human supervision remains necessary, especially where wrong answers carry operational or scientific cost.
Finally, the paper does not benchmark extreme-scale production workloads. The authors argue plausibly that metadata-driven prompting keeps LLM interaction independent of raw task volume, but the in-memory buffer, storage backend, query engine, concurrency model, and graph traversal layer still matter in real deployments. Scaling the prompt is not the same as scaling the system.
The sharper reading
The best way to read this paper is not as another “LLMs for dashboards” story. It is a pattern for turning operational exhaust into governed, queryable intelligence.
The core lesson is simple:
The agent is useful because the system refuses to let the LLM improvise over raw chaos.
It gives the model a compact schema. It gives it examples. It gives it guidelines. It asks for executable queries. It routes those queries through tools. It records the agent’s own actions. It evaluates failures by query class. It treats prompt design as part of system design, not as a mystical craft practised by whoever writes the most apologetic instructions.
For businesses building AI and data infrastructure, that is the transferable insight. Real-time intelligence will not come from chat pasted on top of logs. It will come from provenance architectures where every workflow action is traceable, every diagnostic question has a route to structured data, and every agent answer can be inspected.
The dashboard is not dead. The notebook is not dead. SQL is definitely not dead, despite many premature obituaries from people who have never maintained a data warehouse. But the interface layer is changing.
The next useful observability system will not only show what happened. It will let a human ask why, generate the query, show the evidence, and leave a trace of its own reasoning path.
Not prompts alone. Provenance first.
Cognaptus: Automate the Present, Incubate the Future.
-
Renan Souza, Timothy Poteet, Brian Etz, Daniel Rosendo, Amal Gueroudji, Woong Shin, Prasanna Balaprakash, and Rafael Ferreira da Silva, “LLM Agents for Interactive Workflow Provenance: Reference Architecture and Evaluation Methodology,” arXiv:2509.13978, 2025, https://arxiv.org/abs/2509.13978. ↩︎