Ghostwriters in the Machine: How Multi‑Agent LLMs Turn Raw Transport Data Into Decisions

A bus operator does not usually suffer from a shortage of charts. It suffers from the more irritating problem: charts that explain themselves only to the person who made them.

The fuel-efficiency analyst has a histogram. The data scientist has a clustering plot. The operations manager has a timetable to defend, a fuel bill to reduce, and perhaps a driver-training programme to justify. Somewhere between those roles, insight quietly evaporates into a PDF appendix.

The paper behind today’s article tackles that translation gap rather than pretending that another dashboard will save us. Zhipeng Ma and colleagues propose a multi-agent multimodal LLM framework for turning fuel-efficiency analytics in public transport into structured, stakeholder-ready reports.¹ The important word here is turning. This is not an autonomous bus optimisation agent. It does not decide the modelling method, reroute vehicles, discipline drivers, or magically reduce diesel consumption by reading a ternary plot and feeling inspired. It interprets analytical artefacts already produced by a data-science workflow, checks the resulting narratives, and synthesises them into a report that non-technical stakeholders can actually use.

That sounds modest. It is also where much of the business value lives.

The machine is not driving the bus; it is writing the missing middle layer

The framework is best understood as a reporting pipeline wrapped around analytics, not as an analytics engine wearing a chauffeur’s hat.

The authors design a modular “block” with three roles. First, a data narration agent reads multimodal inputs such as charts, tables, and numerical outputs, then produces a textual explanation. Second, an LLM-as-a-judge agent evaluates that explanation for clarity, relevance, insightfulness, and contextualisation. Third, an optional human-in-the-loop evaluator intervenes when the output needs domain correction or expert validation.

This block is then used repeatedly across four workflow stages:

Stage	What enters the system	What the agents produce	Business meaning	Boundary
Raw data description	Data-distribution artefacts such as histograms	Plain-language description of the dataset and its shape	Managers learn what kind of operational data they are dealing with	The system describes prepared artefacts; it does not audit the raw dataset itself
Data modelling	Model outputs, here GMM clustering visualisations	Explanation of clusters and fuel-efficiency patterns	Technical segmentation becomes readable operational grouping	The modelling choice remains external to the LLM system
Post hoc analytics	Driver, route, and route-type distribution charts	Interpretation of where patterns appear and how concentrated they are	Supports targeted interventions such as route review or driver coaching	Post hoc explanations depend on the artefacts supplied by analysts
Integration and narration	Prior narratives plus the original background	A stakeholder-ready report with recommendations	Converts fragmented analysis into a coherent decision document	The final report inherits earlier errors unless checks catch them

The clever part is not that an LLM can describe a chart. By now, that is less a revelation than a procurement line item. The more interesting move is architectural: the system separates cumulative background context from generated narrative outputs. Earlier descriptions enrich later stages, while narrative products are also carried forward into the final reporting agent. This gives the report generator more context than a one-shot “please summarise these figures” prompt, without requiring every chart to be interpreted in isolation.

There is also a practical anti-hallucination choice hidden in the workflow. The authors avoid feeding extensive raw datasets directly into the LLM because doing so can encourage fabricated or unreliable content. Instead, the system receives visual and analytical artefacts, such as distributions and model-result plots. That design choice matters. In business analytics, LLM reliability often depends less on theatrical model intelligence and more on whether the inputs are shaped into something the model can interpret without inventing a small novel.

The case study is public-bus fuel efficiency, not generic dashboard poetry

The validation case uses fuel-efficiency data from public bus transportation in Northern Jutland, Denmark. The dataset contains 4006 recorded trips, with fuel efficiency measured in litres per 100 km. The underlying analytics use Gaussian Mixture Model clustering to segment trips by fuel-consumption patterns.

The analysis produces eleven scientific charts across the workflow. Stage 1 uses a histogram of fuel-efficiency distribution. Stage 2 uses a probability density estimate and scatter plot to visualise GMM clustering. Stage 3 adds post hoc analytics: driver distributions across clusters, route distributions across clusters, and route-type composition across urban, rural, and highway categories.

This is a good testbed because transport fuel efficiency has a familiar operational logic. Bad fuel performance is rarely just “the bus used too much fuel”. It can be tied to routes, traffic conditions, route type, vehicle assignment, driver behaviour, or combinations that are annoying enough to keep consultants gainfully employed. Clustering helps reveal groups of trips with similar consumption profiles. Post hoc analysis then asks whether those groups align with specific drivers, routes, or route types.

The paper’s framework sits after that analytical work. It does not discover fuel-efficiency theory from first principles. It translates the outputs into a report that could support decisions such as driver-focused interventions, route-specific strategies, route-type monitoring, and continuous fuel-efficiency analysis.

That distinction is not pedantic. It is the difference between “AI can replace the transport analyst” and “AI can reduce the analyst’s reporting burden while making the work more consistently legible”. The first claim is more exciting. The second is more useful.

The benchmark asks which small model can narrate without becoming expensive furniture

The authors run three sets of experiments, each with a different purpose.

Experiment	Likely purpose	What it supports	What it does not prove
Prompt and model comparison for chart narration	Main evidence for choosing the narration configuration	Identifies which model-prompt pairing best balances accuracy, informativeness, cost, and speed	Does not prove the same choice is optimal in all domains
Report-generation model comparison	Main evidence for final synthesis	Shows which model produces data-supported recommendations at acceptable cost	Does not prove recommendations are causally valid interventions
Ablation study	Component contribution test	Shows how background propagation, evaluation, and selective human input affect narrative quality	Does not prove full automation is safe without oversight

The narration benchmark compares five lightweight models: GPT-4.1 mini, o4-mini, Claude 3.5 Haiku, Gemini 2.5 Flash, and Llama 4 Maverick. Each is tested with three prompting strategies: Chain-of-Thought, Contrastive Chain-of-Thought, and DataNarrative prompting. The evaluation metrics include execution time, cost, narrative length, informativeness, accuracy, and Accurate Narrative Information Density, or ANID.

The headline result is not simply “bigger model wins”. GPT-4.1 mini with Chain-of-Thought prompting is selected as the best cost-performance configuration for narration. It achieves 97.31% accuracy, costs 0.24 cents per narration, takes 13.27 seconds, and extracts 20.91 information points on average.

There is a nuance worth keeping. GPT-4.1 mini with Contrastive Chain-of-Thought reaches higher accuracy at 98.52%, but it takes 41.48 seconds and costs 0.62 cents. In other words, the marginal accuracy gain comes with a large increase in execution time and cost. DataNarrative prompting extracts more information points with GPT-4.1 mini, 25.55 versus 20.91, but accuracy falls to 95.45%. More information is not automatically better information. Enterprise reporting teams may wish to print that on a mug.

The selected configuration is therefore not the absolute winner on every metric. It is the sensible compromise: high accuracy, decent informativeness, lower cost, and manageable runtime.

Configuration	Accuracy	Cost	Time	Interpretation
GPT-4.1 mini + CoT	97.31%	0.24 cents	13.27 s	Best balance for narration
GPT-4.1 mini + CCoT	98.52%	0.62 cents	41.48 s	Slightly more accurate, much heavier
GPT-4.1 mini + DataNarrative	95.45%	0.23 cents	10.91 s	More information points, lower accuracy
o4-mini + CoT	96.97%	0.96 cents	14.53 s	Competitive accuracy, higher cost
Gemini 2.5 Flash + DN	90.60%	0.07 cents	5.14 s	Cheap and fast, weaker accuracy

For business use, the lesson is blunt: model selection should be tied to reporting risk. If the output informs low-stakes internal exploration, cheap and fast may be enough. If it supports operational recommendations, a few percentage points of accuracy matter. If it supports regulatory reporting, procurement committees may need to stop asking only about token cost and start asking about verification design. A radical idea, I know.

The final report test rewards grounded recommendations, not pretty readability scores

The second experiment evaluates the model used for final report generation. Here the candidates are GPT-4.1, GPT-4.1 mini, and o4-mini. The paper assesses readability, cost, report length, and whether recommendations are explicitly supported by the underlying chart narratives.

GPT-4.1 and GPT-4.1 mini both generate five recommendations, with all five supported by the data narratives. GPT-4.1 mini does so at 0.47 cents, while GPT-4.1 costs 2.52 cents. o4-mini produces the most readable report by the paper’s readability measures, but only three of its recommendations are data-supported, giving it a 75% support rate, and it costs 1.52 cents.

That result is quietly important. In stakeholder communication, readability matters. But unsupported readability is just confidence with better line spacing. The authors prioritise recommendation support over surface readability, which is the right hierarchy for decision support. A beautiful report that smuggles in weakly grounded advice is not a report. It is a brochure with a spreadsheet accent.

The system therefore selects GPT-4.1 mini again, this time for final synthesis. It matches GPT-4.1 on supported recommendations while being much cheaper. For organisations trying to industrialise analytics reporting, that kind of result is more relevant than leaderboard glamour. The winning model is not the one with the grandest name; it is the one that produces enough grounded usefulness per unit cost.

The ablation study says validation is useful, but more validation is not always more truth

The ablation study tests four configurations: Chain-of-Thought alone, Chain-of-Thought with background context, Chain-of-Thought with background plus an LLM-as-a-judge, and the full baseline with background, LLM-as-a-judge, and optional human intervention.

This is not a side quest. It addresses the central design question: do the agents actually add value, or are they just a diagrammatic way to make a prompt look like enterprise architecture?

The full baseline performs best overall, especially on accuracy, while producing the shortest narratives. The paper reports that 10 of the 11 narratives in the baseline require no human intervention, and the one case that does involve human intervention improves the narrative. That is the desirable form of human-in-the-loop design: not ceremonial approval at every step, but selective escalation when machine narration hits something it cannot reliably infer.

The more interesting finding is that forced extra validation can backfire. The CoT+B+E configuration is designed to explicitly invoke the LLM-as-a-judge by forcing a revision once. It can add detail, but some of that additional detail becomes irrelevant or inaccurate, reducing factual precision. The obvious corporate instinct is to add more review layers until no one remembers what was being reviewed. The paper suggests a more disciplined lesson: validation should be triggered by need, not by superstition.

In reporting systems, verbosity often masquerades as rigour. Here, the best configuration is concise, accurate, and selectively human-assisted. That is a useful design principle.

The business value is standardised interpretation, not automated optimisation

What does this mean for organisations outside Northern Jutland bus operations?

The direct result is narrow: the paper demonstrates a framework on one public-transport fuel-efficiency case using 4006 trips, GMM clustering, and eleven chart inputs. It shows that a multi-agent LLM workflow can generate accurate chart narratives and stakeholder-oriented reports under that setup.

The broader business inference is more interesting but must be kept in its lane. Many organisations already have analytics pipelines that produce useful but fragmented artefacts: dashboards, cluster plots, SHAP charts, residual diagnostics, exception reports, and tables that only the modelling team enjoys. The bottleneck is often not the existence of analytics. It is the conversion of analytics into decision-grade communication.

A framework like this could sit between analytics production and managerial consumption. It could generate draft reports for fleet managers, energy teams, manufacturing supervisors, maintenance planners, or finance operations groups. It could standardise how findings are described, require recommendations to be supported by source artefacts, and reduce the amount of manual report writing required from scarce technical staff.

The ROI case is therefore not “LLMs discover the answer”. It is closer to:

analysts produce technical artefacts;
the multi-agent system converts them into structured explanations;
automated evaluation filters weak narratives;
humans intervene only on ambiguous or sensitive outputs;
managers receive a report that is easier to act on.

This matters because decision latency is a real operational cost. If fuel-efficiency insights sit in a chart pack until the monthly meeting, nothing has really been optimised. If the same insights can be turned into readable recommendations quickly and consistently, the organisation has a better chance of acting before the pattern becomes a budget line.

The framework still depends on the quality of the artefacts it reads

The paper is appropriately clear about several boundaries.

First, performance depends on the underlying LLM. GPT-4.1 mini performs well in the tested setting, but the authors note that lack of fine-tuning on energy-specific corpora may limit domain understanding when terminology, policy, or technical metrics become more specialised.

Second, visual interpretation depends heavily on chart quality. Ambiguous labels, poor encodings, or incomplete visual context can produce generic or inaccurate summaries. This is not a trivial weakness. Many enterprise dashboards are designed with the aesthetic discipline of a ransom note.

Third, the current architecture does not perform global consistency checks across narrative stages. A small error in an early description could propagate into later synthesis if the judge agent or human reviewer does not catch it. For practical deployment, this is one of the most important open issues. A reporting agent should not merely write fluently; it should preserve consistency across the report.

Fourth, the system is validated on a single domain case. The authors argue that the architecture is domain-agnostic because the artefacts can be swapped for other operational metrics, such as manufacturing data or fault-diagnosis outputs. That is plausible, but not yet proven by the evidence presented. Generalisability remains a research agenda, not a trophy.

Finally, the system does not select the analysis itself. In Stage 3, for example, it relies entirely on post hoc artefacts provided by data scientists or domain experts. That is exactly as it should be. Let the modelling pipeline generate the analysis; let the narration pipeline make the analysis intelligible. Confusing those jobs is how organisations end up with confident nonsense wearing a dashboard badge.

What buyers and builders should take from the paper

For builders, the paper offers a useful template: do not ask one model to “summarise everything”. Split the work. Preserve background context. Process artefacts modularly. Evaluate outputs against explicit criteria. Escalate to humans selectively. Then synthesise.

For buyers, the lesson is to ask sharper questions before procuring AI reporting tools:

Question	Why it matters
What artefacts does the system read: raw data, charts, model outputs, or all of them?	Input design strongly affects hallucination risk
Are recommendations explicitly linked to source evidence?	Unsupported recommendations are operational liabilities
Is there a global consistency check across report sections?	Early errors can propagate into final reports
When does human review trigger?	Constant review kills scalability; no review kills trust
How are accuracy and informativeness measured?	Vague “quality” scores are not enough
What happens when chart labels are ambiguous?	Visual analytics fails quietly when inputs are poorly specified

The most mature use case is not replacing analysts. It is reducing the communication burden on analysts while giving managers more consistent reports. That may sound less cinematic than autonomous AI operations. It is also far more likely to survive contact with a procurement committee, an operations team, and the person who has to explain the fuel bill.

The ghosts are useful when they stay in the machine room

This paper is not a revolution in public transport optimisation. It is a practical contribution to a less glamorous but important problem: how to move from fragmented analytics to coherent decision support.

The mechanism-first view matters because the framework’s value sits in the orchestration. A narration agent describes. A judge agent evaluates. A human expert intervenes only when necessary. A reporting agent synthesises without supposed creative embellishment. The case study shows that this arrangement can produce accurate chart narratives and data-supported recommendations at low cost, especially with GPT-4.1 mini and Chain-of-Thought prompting.

The business interpretation is therefore disciplined optimism. Multi-agent multimodal LLMs can help operational teams understand analytics faster and more consistently. They can turn chart packs into reports. They can make technical findings more accessible. They can even reduce the amount of interpretive glue-work that keeps data teams trapped in document production.

But they do not eliminate the need for good analytics, clean artefacts, expert oversight, or judgement. The ghosts in the machine are ghostwriters, not fleet managers. That is already useful enough.

Cognaptus: Automate the Present, Incubate the Future.

Zhipeng Ma, Ali Rida Bahja, Andreas Burgdorf, André Pomp, Tobias Meisen, Bo Nørregaard Jørgensen, and Zheng Grace Ma, “Multi-Agent Multimodal Large Language Model Framework for Automated Interpretation of Fuel Efficiency Analytics in Public Transportation,” arXiv:2511.13476. ↩︎

The machine is not driving the bus; it is writing the missing middle layer#

The case study is public-bus fuel efficiency, not generic dashboard poetry#

The benchmark asks which small model can narrate without becoming expensive furniture#

The final report test rewards grounded recommendations, not pretty readability scores#

The ablation study says validation is useful, but more validation is not always more truth#

The business value is standardised interpretation, not automated optimisation#

The framework still depends on the quality of the artefacts it reads#

What buyers and builders should take from the paper#

The ghosts are useful when they stay in the machine room#