Dirty Data, Clean Machines: How LLM Agents Rewire Predictive Maintenance

Workshop logs are not glamorous. They are where predictive-maintenance dreams go to meet misspelled component names, missing codes, wrong vehicle identifiers, and dates that imply a truck was both under repair and happily accumulating kilometres. Industrial AI, as ever, is less a matter of elegant algorithms than of persuading messy operational records to stop lying.

A recent paper from researchers at the University of Luxembourg tests whether large language model agents can help with that unromantic but decisive problem: cleaning automotive maintenance logs before they poison predictive-maintenance pipelines.¹ The answer is useful because it is uneven. LLM agents performed well on generic data-cleaning tasks: accepting clean records, rejecting out-of-fleet records, spotting digital system tests, correcting missing values, and repairing many invalid categorical fields. They also largely failed where cleaning required domain-specific reasoning: temporal consistency and vehicle identifier alignment.

That split is the paper’s real contribution. Not “LLMs clean maintenance data now, please alert procurement.” More precisely: LLM agents can already automate parts of maintenance-log triage, but the economically valuable edge cases are exactly where current agents still need help. Convenient, of course. The boring part works first.

The scoreboard matters more than the architecture diagram

The study benchmarks six production LLMs in a controlled agentic environment. Each agent receives one maintenance record at a time and must choose exactly one action through a structured Log Cleaning API:

Agent action	Meaning in the benchmark	Operational analogue
`accept`	The record is clean	Let the record enter the PdM dataset
`reject`	The record is out-of-scope or irreparable	Quarantine test records, non-fleet records, or unusable entries
`update`	The record has one correctable field	Repair the record before ingestion

The agents can query three reference sources: the fleet registry, the service catalogue, and odometer time-series data. They can list tables, inspect schemas, and run SQL queries. This is not a chatbot reading a CSV and improvising heroically. It is a tool-using agent placed inside a small enterprise-data environment and forced to act through a narrow interface. Good. Narrow interfaces are how industrial AI survives contact with auditors.

The paper’s main evidence is the performance table across six noise types plus a clean condition. The usage table is secondary but operationally important because it shows the cost–latency trade-off. The synthetic data generator and prompt appendix are implementation details. The repeated environments and small decoding-parameter variations act as a robustness check rather than a separate thesis.

Evidence item	Likely purpose	What it supports	What it does not prove
Six-noise performance table	Main evidence	Which errors LLM agents can detect or repair	Production reliability on real fleet data
Usage metrics by model	Deployment economics	Relative cost, latency, and token burden	Total cost of ownership in a live system
Synthetic generator	Benchmark implementation	Reproducible, privacy-safe evaluation	Full realism of industrial maintenance workflows
Zero-shot prompt appendix	Implementation detail	The agents were lightly instructed, not example-trained	That better prompting or fine-tuning would be unnecessary
31 independent environments with sampled decoding parameters	Robustness/sensitivity test	Results are not tied to one random seed or one decoding setting	Coverage of all real-world noise distributions

The results are stark. GPT-5 reached 99.3% error detection on clean records, 98.7% on out-of-fleet vehicles, 99.7% on digital system test records, 83.7% detection and correction on invalid categorical values, and 100% on missing values. GPT-OSS-120B was close on several generic repair categories, with 81.3% detection/correction on invalid values and 99.3% on missing values.

Then the floor collapsed. For wrong end dates, every model recorded 0% correction. Vehicle identifier misalignment was only slightly less brutal: GPT-5 reached 27.7% detection and correction, while the other models stayed in single digits or zero.

That is not a footnote. That is the difference between cleaning spreadsheet errors and understanding a maintenance process.

Generic cleaning transfers; maintenance reasoning does not

The easy categories are easy for a reason. “Out-of-fleet vehicle” is mostly a membership check against a registry. “Digital system test” often has explicit semantic clues: test-like values, non-maintenance language, or fields that look intentionally artificial. “Missing value” and “invalid categorical value” can be resolved by consulting a service catalogue, matching likely categories, or repairing obvious vocabulary errors.

These are not trivial, but they are recognisable LLM territory. The model can use language, schema context, and table lookup. It can notice that “Brake Sysem” probably means “Brake System.” It can infer that a missing component belongs inside a known system–subsystem–activity pattern. It can reject a record that belongs to a valid-looking vehicle outside the monitored fleet. That is data curation as constrained semantic repair.

Wrong end dates are different. The agent must connect the maintenance interval to odometer behaviour. In the synthetic data, a vehicle under maintenance should show no distance for full maintenance days, with half-distance on boundary days. A wrong end date is not visible as a typo. It is visible only when the reported maintenance period disagrees with the time-series signal.

That is a more demanding form of reasoning: temporal, cross-table, and procedural. The agent must know what pattern to look for, retrieve the right signal, compare dates, and infer the corrected endpoint. Current LLM agents, in this setup, did not do it.

Vehicle identifier misalignment is also harder than it looks. The noisy record may replace a licence plate with another identifier such as a device ID, display name, or VIN. A human data engineer would treat this as an entity-resolution task: identify which field is actually present, map it through the fleet registry, and restore the expected licence plate. The best model only corrected 27.7% of these cases. That is not enough for unsupervised master-data repair unless one enjoys explaining corrupted vehicle histories to a maintenance director.

The lesson is not that LLMs cannot reason. The lesson is narrower and more useful: general-purpose LLM agents are much better at schema-adjacent semantic cleanup than at domain-specific consistency checking that requires an inductive bias about how the physical system behaves.

The benchmark is synthetic, but not meaningless

The paper builds a synthetic automotive fleet environment because real maintenance data is hard to release. That is not an academic eccentricity; it is the normal reality of industrial AI. OEMs, workshops, and fleet operators are not lining up to publish records about reliability, repair practices, vehicle use, and internal process quality. Funny how proprietary data remains proprietary.

The generator creates four linked sources: a fleet registry, sensor data, a service operations catalogue, and a maintenance log. The clean fleet registry and odometer signals act as reference sources. The maintenance log is corrupted with controlled noise.

The six noise types are practical rather than ornamental:

Noise type	What is corrupted	Expected action	Why it matters operationally
M1: vehicle identifier misalignment	Licence plate replaced by another identifier	Update	Breaks the link between repair history and vehicle telemetry
M2: out-of-fleet vehicle	Valid-looking external plate appears	Reject	Pollutes fleet-specific reliability analysis
M3: invalid values	Category typos or invalid service labels	Update	Damages component-level failure modelling
M4: missing values	One or more categorical fields blank	Update	Weakens root-cause and failure-mode analysis
M5: digital system test	IT/testing entry masquerades as maintenance	Reject	Confuses monitoring-system events with vehicle repairs
M6: wrong end date	Repair interval inconsistent with odometer signal	Update	Distorts downtime, exposure, and event timing

This taxonomy is useful because it distinguishes two kinds of dirt. Some dirt is administrative: the wrong label, the missing component, the external vehicle, the test entry. Other dirt is operational: the record contradicts the behaviour of the machine. LLM agents handled much of the first kind. They failed badly on the second.

The benchmark also imposes simplifying assumptions. The fleet is homogeneous and static. Vehicles share operational profiles. Each vehicle has at most one maintenance event. Repairs happen at one central facility. The maintenance records are corrective repairs. The noise distribution is deliberately balanced: 30 records for each of six noise types and 30 clean records in each 210-record environment.

That balance is excellent for measuring per-noise performance. It is not how real maintenance logs behave. In production, test records may be rare, typos may cluster by workshop or technician, missing values may correlate with process pressure, and identifier problems may arise from system migrations rather than random perturbation. The synthetic setup gives control, not external validity by magic.

The cost table changes the deployment question

The benchmark includes six models: Nemotron-Nano-9B-v2, GPT-OSS-20B, GPT-OSS-120B, Qwen3-Next-80B-A3B-Instruct, Kimi-K2-0905, and GPT-5. GPT-5 performs best in several categories, but it is also the most expensive and slowest configuration in the reported usage metrics: about $5.86 per experiment and 11,051 seconds.

GPT-OSS-120B is the more interesting business result. It delivered competitive performance on generic repairs at roughly $0.18 per experiment, with about 1.86 million input tokens, 169,000 output tokens, and 3,908 seconds of runtime. GPT-OSS-20B and Nemotron sat in the low-cost range, around $0.08 and $0.09 respectively, but their repair quality was weaker, especially for fine-grained corrections.

This matters because maintenance-log cleaning is not a single heroic inference. It is a repeated operational process. A fleet operator does not need one perfect answer for a demo. It needs thousands of acceptable decisions at a cost low enough to justify continuous ingestion.

The paper therefore points toward tiered deployment rather than one-model supremacy. A cheap model may be good enough to accept clearly clean records and reject obvious out-of-scope entries. A stronger model may handle uncertain categorical repairs. Temporal and identifier anomalies should probably go to deterministic validators, specialised models, or human review until the agentic layer improves.

That is less glamorous than “GPT-5 fixes maintenance.” It is also the version that might survive a budget meeting.

The business value is triage before prediction

Predictive maintenance projects often talk about model accuracy as if the model were the main bottleneck. In many industrial settings, the bottleneck appears much earlier: in the labels, logs, and event histories that tell the model what actually happened.

A bad maintenance log can distort three things at once. It can corrupt the target label for failure prediction. It can misalign the event date used to build time-to-failure windows. It can attach the repair to the wrong vehicle, which is a tidy way of making both vehicles analytically useless. The model downstream may be blamed for poor performance, but the crime scene is upstream.

This paper’s practical implication is not that LLM agents replace data engineers. It is that agents can become a stream-processing triage layer between raw workshop input and the analytical warehouse.

Cognaptus would read the deployment pattern like this:

Pipeline layer	What the paper directly supports	What should be added before production
Clean-record acceptance	High detection rates across models on noise-free records	Confidence thresholds and audit sampling
Obvious rejection	Strong performance on out-of-fleet and many test records	Policy rules for quarantine versus deletion
Simple semantic repair	Strong large-model results on missing and invalid categorical fields	Controlled vocabularies, repair provenance, human override
Identifier repair	Weak current performance	Master-data services, entity-resolution rules, review queue
Temporal repair	Current agents failed to correct wrong end dates	Temporal-logic validators, odometer consistency checks, deterministic pre/post-validation
Continuous ingestion	The setup simulates one-record-at-a-time processing	Integration with live systems, latency targets, rollback mechanisms

The useful architecture is hybrid. Let rules do what rules are good at: checking hard constraints, validating dates, enforcing identifier formats, and flagging impossible event sequences. Let LLM agents do what they are better at: interpreting messy descriptions, reconciling category labels, proposing likely repairs, and explaining why a record should be quarantined.

The paper’s own future-work direction aligns with this: temporal-logic validators, hybrid rule–LLM architectures, domain fine-tuning, richer synthetic noise, persistent memory, and evaluation on anonymised real logs. That is the right direction. It also means the near-term product is not an autonomous “maintenance data cleaner.” It is a controlled curation assistant with sharp edges padded.

Where the result applies, and where it does not

This study gives a credible early benchmark for LLM agents in predictive-maintenance data cleaning. It does not give a production guarantee.

The most important boundary is synthetic data. The noise categories were informed by real-world maintenance-log patterns, but the experiments were conducted on generated environments. That choice improves reproducibility and privacy, but it limits claims about messy operational systems. Real logs contain multi-field corruptions, partial migrations, duplicated work orders, inconsistent technician practices, undocumented interventions, and ambiguous semantics that do not politely fit into one of six balanced categories.

The second boundary is task design. Each record required exactly one of three actions, and updates were single-field corrections. That is clean for benchmarking. Production records are less courteous. One row may have a wrong date, a missing component, a stale identifier, and a work description that contradicts all of them. A real cleaner may need multiple edits, escalation, or a confidence-based partial repair.

The third boundary is reference quality. In the benchmark, the fleet registry and odometer data are clean. In live environments, the reference sources may themselves contain errors. A model that can query a database is only as useful as the database it is querying. Yes, even the magic agent must occasionally live in the same swamp as everyone else.

The fourth boundary is economics. The reported cost and time are useful for comparing models under the study’s setup, not for calculating a full production business case. Real deployment would include orchestration, logging, security, human review, data integration, model monitoring, and failure handling. Token cost is the visible part of the invoice.

The right lesson is not autonomy. It is allocation.

The lazy interpretation is that LLM agents are now ready to clean predictive-maintenance logs. The equally lazy counter-interpretation is that the failure on wrong end dates proves the approach is not ready. Both miss the point.

The evidence supports a more disciplined allocation of work. Use LLM agents for high-volume, language-heavy, schema-adjacent cleaning. Use deterministic tools for hard temporal and entity constraints. Use human review for cases where a wrong repair would corrupt maintenance history or financial accountability. Track all changes as data lineage, not vibes with timestamps.

That allocation can still be valuable. Even if agents only automate clean acceptance, obvious rejection, and simple categorical repair, they can reduce the backlog of manual preprocessing and move maintenance projects closer to continuous data readiness. The time saved is not just clerical. Cleaner logs improve failure labels, downtime estimation, component-level analysis, and the trustworthiness of downstream PdM models.

The deeper implication is that predictive maintenance may need fewer grand AI promises and more intelligent plumbing. Sensors, models, and dashboards all matter. But if the repair log says the wrong vehicle was fixed on the wrong date for the wrong component, no amount of neural sophistication will save the forecast. It will simply be wrong with excellent GPU utilisation.

LLM agents are beginning to make industrial data cleaning more adaptive. This paper shows where that adaptation works first: generic curation under constrained tools. It also shows where the next frontier sits: domain-aware validation that understands machines as processes over time, not just rows in a table.

Clean machines, it turns out, still begin with clean records. The agents can help. They just should not be left alone with the calendar yet.

Cognaptus: Automate the Present, Incubate the Future.

Valeriu Dimidov, Faisal Hawlader, Sasan Jafarnejad, and Raphaël Frank, “Cleaning Maintenance Logs with LLM Agents for Improved Predictive Maintenance,” arXiv:2511.05311, 2025. https://arxiv.org/abs/2511.05311 ↩︎

The scoreboard matters more than the architecture diagram#

Generic cleaning transfers; maintenance reasoning does not#

The benchmark is synthetic, but not meaningless#

The cost table changes the deployment question#

The business value is triage before prediction#

Where the result applies, and where it does not#

The right lesson is not autonomy. It is allocation.#