Ghostwriters in the Machine: How Multi‑Agent LLMs Turn Raw Transport Data Into Decisions

Opening — Why this matters now

Public transport operators are drowning in telemetry. Fuel logs, route patterns, driver behavior metrics—every dataset promises “efficiency,” but most decision-makers receive only scatterplots and silence. As AI sweeps through industry, the bottleneck is no longer data generation but data interpretation. The paper we examine today argues that multimodal LLMs—when arranged in a disciplined multi‑agent architecture—can convert analytical clutter into credible, consistent, human-ready narratives. Not hype. Not dashboards. Actual decisions.

Background — Context and prior art

Classic analytics teams produce a familiar ritual: raw-data charts, clustering figures, post hoc explanations, then someone stays late rewriting everything into a stakeholder report. Prior LLM-based systems focused on single tasks—captioning a chart, summarizing a table, or describing an image. But real industrial pipelines are not a single chart; they are chains of heterogeneous artifacts that need continuity and judgment.

The literature to date—LLM4Vis, DataNarrative, DS-Agent, etc.—mostly stops at chart-level summarization or sandboxed storytelling. Useful, but insufficient for domains like energy informatics, where narrative precision (and the absence of hallucinated numbers) is non-negotiable.

Analysis — What the paper does

The paper proposes a three-agent multimodal LLM framework that automates scientific narration across an entire analytics pipeline:

Data Narration Agent – Interprets each multimodal artifact (charts, tables, distributions) using global and domain-specific knowledge.
LLM-as-a-Judge Agent – Scores the narrative for clarity, relevance, insightfulness, and context using a deterministic scoring rubric.
Human-in-the-Loop (optional) – Intervenes only when the model’s interpretation enters “dangerously confident” territory.

This architecture processes analysis in four stacked stages:

Raw data description
Modeling results (e.g., GMM clustering)
Post hoc analytics (driver entropy, route patterns, route-type variability)
Final integrated report generation

Each stage passes both context and narrative into the next, creating continuity rather than isolated explanations.

Findings — Results with visualization

The authors benchmark 15 combinations of prompting strategies and LLMs. Chain-of-Thought (CoT) paired with GPT‑4.1‑mini emerges as the sweet spot: high accuracy, reasonable cost, and minimal hallucination.

Below is a summary table of the key model–prompting tradeoffs:

Model + Prompt	Accuracy	Informativeness	Execution Time	Cost	Notes
GPT‑4.1 mini + CoT	97.3%	High	Moderate	Low	Best overall balance
GPT‑4.1 mini + CCoT	98.5%	Moderate	Very Slow	Higher	Slightly better accuracy, not worth overhead
o4‑mini + DN	94.4%	High	Moderate	High	Most verbose; more errors
Claude 3.5 Haiku variants	84–91%	Medium	Fast	Medium	Good speed, lower precision
Gemini Flash	90–93%	Medium	Fast	Low	Competitive but less consistent

In the final report stage, GPT‑4.1 mini again wins, producing 100% data-supported recommendations at a fraction of the cost of full-sized models.

An interesting metric introduced by the authors is Accurate Narrative Information Density (ANID)—correct information points per 100 words. A polite reminder that verbosity is not intelligence.

Implications — Where this matters for business and industry

For industries where analytics output multiplies faster than analyst headcount, this architecture matters because it:

Reduces interpretation bottlenecks — charts become narratives without manual stitching.
Enforces quality control — the judge-agent prevents “eloquent nonsense,” a frequent LLM failure mode.
Supports compliance and auditability — narrative provenance is traceable back through each block.
Lowers the cost of decision-making — especially when lightweight models outperform their heavyweight siblings.

In public transport fuel management, the framework synthesized 4006 trips, GMM clusters, driver entropy, and route-type variability into actionable recommendations—not just descriptions. Translate this to manufacturing, logistics, energy, or insurance, and the ROI picture becomes obvious.

Conclusion

The paper delivers a clear message: the future of industrial analytics is not a single omnipotent model, but disciplined multi-agent cooperation—each agent specializing, checking, and contextualizing. For organizations trying to turn floods of multimodal data into decisions, this architecture is not futuristic. It is simply overdue.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — Context and prior art#

Analysis — What the paper does#

Findings — Results with visualization#

Implications — Where this matters for business and industry#

Conclusion#