TL;DR

Most teams still analyze pipelines with brittle SQL, custom scripts, and static dashboards. A new reference architecture shows how schema-driven LLM agents can read workflow provenance in real time—across edge, cloud, and HPC—answering “what/when/who/how” questions, plotting quick diagnostics, and flagging anomalies. The surprising finding: guideline-driven prompting (not just bigger context) is the single highest‑ROI upgrade.


Why this matters (for operators, data leads, and CTOs)

When production AI/data workflows sprawl across services (queues, training jobs, GPUs, file systems), the real telemetry isn’t in your app logs; it’s in the provenance—the metadata of tasks, inputs/outputs, scheduling, and resource usage. Turning that exhaust into live answers is how you:

  • Shorten mean‑time‑to‑explain (MTTX) for incidents and drifts.
  • Reduce post‑mortem archaeology after failed runs.
  • Make compliance and reproducibility checks on demand, not quarterly.

This work proposes a lightweight agent that sits on top of provenance streams and a dynamic schema, so the LLM doesn’t need to ingest your entire database to be useful.


What they actually built

  • A reference architecture for a provenance‑aware AI agent that interprets natural‑language questions and returns structured queries, tables, or plots—without coupling the LLM to your raw data volume.
  • A dynamic dataflow schema that the agent maintains on the fly from incoming provenance messages, capturing activities, inputs/outputs, and example values.
  • A modular toolchain using a streaming hub, a provenance keeper/database, an LLM server, an anomaly detector, and an MCP‑based tool router.
  • An evaluation methodology spanning four query classes (what/when/who/how) and both OLTP‑style lookups and OLAP‑style explorations, tested on a synthetic math workflow and a real DFT chemistry workflow.

The core idea in one picture

A live provenance stream feeds a Context Manager that keeps two compact structures in memory:

  1. In‑memory context (recent tasks)
  2. Dynamic dataflow schema (fields + sample values + inferred types)

When you ask: “Which tasks spiked GPU memory during conformer search on Frontier?” the agent uses guidelines + schema to generate the query code (not a giant data paste), executes locally against the buffer or against the provenance DB, and returns a small table/plot.


Provenance query types (business‑friendly map)

Axis Options Typical questions
Workload OLTP lookups “Show runs for model X after 12:00.”
OLAP exploration “Group by node; find tasks with longest wait time.”
Data type Control flow “Which step triggered downstream retries?”
Dataflow “Where did feature drift first appear?”
Scheduling “Which cluster/node executed failing steps?”
Telemetry “Plot GPU mem vs. step for last 20 runs.”
Mode Online Live monitoring and alerts
Offline Post‑hoc for audits/reproducibility

What the experiments show (practical takeaways)

  • Guidelines beat raw context. Adding crisp, domain‑specific query guidelines delivered the biggest accuracy jump; tossing in more schema/examples helps, but with higher token cost.
  • OLTP is easy; OLAP is hard. Simple lookups score near‑perfect. Exploratory, graph‑ish questions (control/dataflow) remain error‑prone—plan for fallback tools and human‑in‑the‑loop.
  • Model parity at the top. Frontier models (GPT, Claude) perform similarly at near‑perfect levels with the full setup; smaller open models vary more. Use adaptive routing by query class.
  • Interactive latency is fine. Even with full prompts, responses stayed around the “feels live” bar for dashboards.

Implementation sketch (what you’d actually deploy)

  • Capture: decorators & observability adapters emit standard provenance messages (task IDs, used/generated payloads, telemetry, timings).
  • Stream: Redis/Kafka/Mofka as the streaming hub.
  • Persist: keeper normalizes to a PROV‑style schema in Mongo/Neo4j/LMDB.
  • Agent: MCP server with a Tool Router, Context Manager, Anomaly Detector, and Guidelines store.
  • UI/API: chat UI + HTTP query endpoint; optional Grafana overlay.

Critical design choice: send schema + queries to the LLM, not bulk records. You stay inside context limits and avoid shipping sensitive data.


Where this beats your dashboard

  • Natural questions → structured queries → immediate plots.
  • Cross‑cutting joins (flow + telemetry + scheduling) without pre‑built panels.
  • Explainability trail: every tool call and LLM exchange becomes its own provenance, so you can audit the analysis itself.

Limitations to respect

  • Complex multi‑hop graph analysis still benefits from a graph DB + bespoke code.
  • Accuracy hinges on good field names and semi‑sane instrumentation.
  • OLAP‑heavy analytics still need “classic” data processing alongside the agent.

A 90‑day playbook to pilot this in your stack

Days 1–15: Instrument two pipelines (one synthetic, one real) with minimal field names and consistent units. Days 16–30: Stand up streaming hub + keeper + Mongo/Neo4j; enable an MCP agent with guideline injection. Days 31–60: Build a golden query set of 20–30 questions (mix OLTP/OLAP). Iterate prompts and guidelines weekly. Days 61–90: Add an anomaly detector for telemetry spikes; wire alerting. Trial LLM routing by query class.

Key success metrics: time‑to‑first‑answer (TTFA), guided vs unguided accuracy, % of incidents diagnosed via agent queries, and number of dashboard panels you retire.


What this means for Cognaptus clients

For clients with multi‑stage AI/ETL and GPU training, this agent pattern is a fast path to observability with brains. It cuts through glue‑code debt, makes audits explainable, and keeps context windows small enough for production.

Bottom line: provenance agents are the missing middle layer between raw telemetry and decision‑grade answers.


Cognaptus: Automate the Present, Incubate the Future