TL;DR for operators
Bridge teams do not usually lack data. They lack enough expert time to turn dense inspection data into clear, defensible decisions. That is the operational gap this paper tries to narrow: not by replacing bridge engineers with a chatbot in a hard hat, thankfully, but by using multimodal LLMs to translate non-destructive evaluation contour maps into structured condition assessments and maintenance recommendations.1
The paper proposes a three-stage workflow. First, five NDE contour maps from one bridge are prepared as inputs: two Ground Penetrating Radar views, Electrical Resistivity, Impact-Echo, and Ultrasonic Surface Waves. Second, multiple image-captioning models interpret the maps in parallel. Third, the stronger interpretations are fed into summarization models that consolidate them into a bridge-condition report.
The strongest captioning models in the pilot were Claude 3.5 Sonnet, ChatGPT-4, CogVLM2, and ShareGPT4V. Claude, ChatGPT-4, and CogVLM2 received the top overall image-captioning rating of 5, while ShareGPT4V received 4 because it lacked specificity. For summarization, ChatGPT-4 scored 5.00 and Claude 3.5 Sonnet scored 4.67, ahead of Mistral, Gemini, and Llama3.
For operators, the immediate business value is not autonomous bridge certification. It is faster first-pass triage, more consistent report drafting, and a better interface between technical NDE specialists and budget-holding decision-makers. The dull but essential phrase is “human-in-the-loop.” The study itself practically begs for it.
The main boundary is scale. The experiment uses one FHWA bridge case in Mississippi, five contour maps, qualitative ratings, researchers plus one domain expert, and prompt-optimised model outputs. That is enough to show a plausible workflow. It is not enough to hand an LLM a capital maintenance budget and tell it to enjoy itself.
The machine does not inspect the bridge; it organises the evidence
A bridge inspection file can contain several kinds of truth at once. One map hints at cover depth. Another suggests moisture or deterioration. Another points toward corrosion risk. Another says something about delamination. Another approximates concrete stiffness. Individually, each is a technical signal. Together, they become an argument about where the bridge may be weakening.
That “together” is the expensive part.
The paper’s useful move is not simply asking a model to caption an image. Captioning is the visible surface. The deeper mechanism is workflow orchestration: feed several technical contour maps into multiple multimodal models, compare their descriptive outputs, then use a second LLM layer to consolidate the interpretations into a condition summary.
A simplified version looks like this:
| Stage | Input | Model role | Output |
|---|---|---|---|
| Data preparation | Five NDE contour maps | Standardise visual inputs and prompt context | A common inspection task for each model |
| Parallel captioning | GPR, ER, IE, USW maps | Describe condition signals, likely defects, and poor-condition areas | Multiple model interpretations |
| Model filtering | Caption outputs | Select stronger interpretations based on relevance, usefulness, coverage, specificity | Shortlist of high-quality captions |
| Summarisation | Top caption outputs | Consolidate findings into one report | Condition summary and recommendations |
| Human review | Generated report | Validate against engineering judgement and field evidence | Inspection support, not final authority |
This is why a mechanism-first reading matters. A leaderboard summary would say “GPT-4 and Claude did well.” True, but incomplete. The more important claim is that LLMs may become a connective layer inside infrastructure inspection workflows, where the bottleneck is not only seeing the signal but turning heterogeneous signals into a usable maintenance narrative.
That distinction matters because infrastructure agencies do not procure “interesting model behaviour.” They procure workflows that shorten cycle time, reduce ambiguity, and survive audit. The bridge does not care whether the summary sounds elegant. The procurement officer, engineer, and liability insurer rather do.
Five contour maps are five different languages of deterioration
The study uses five contour maps from one bridge in the Federal Highway Administration bridge database. The selected bridge is in Mississippi, with Structure Number 11002200250051B and LTBP Bridge Number 28-000008. The authors note that, as of July 23, 2024, 38 bridges in the database included NDE data, but this pilot examines only one bridge.
The five maps are not redundant images of the same thing. They represent different measurement technologies and different physical interpretations:
| NDE map | What it measures in the paper | What it can contribute to assessment |
|---|---|---|
| GPR cover depth | Cover depth in inches | Whether reinforcing steel has enough concrete cover for protection |
| GPR DCA | Depth-Corrected Attenuation at top rebar level, in decibels | Possible deterioration, moisture-related effects, or internal condition trends |
| Electrical Resistivity | Resistivity patterns | Concrete properties and corrosion activity risk |
| Impact-Echo | Frequency in kHz | Concrete integrity and possible delamination |
| Ultrasonic Surface Waves | Modulus in ksi | Concrete mechanical properties and stiffness variation |
This is where multimodal LLMs become interesting, and also dangerous. They are not merely reading a photograph of a crack. They are interpreting technical visualisations that encode physical measurements. The colours and spatial patterns are not decoration. They are the data.
A conventional image-captioning model can get away with saying “a dog sitting on grass.” An infrastructure model must do something more awkward: infer what a low-frequency patch, a resistivity anomaly, or a lower-modulus region might mean for a specific part of a bridge deck. That is less like describing an image and more like narrating an instrument panel.
The paper’s prompt reflects this. It tells the model to act as a structural engineer, notes that the x- and y-axes are in feet, and asks it to describe bridge condition, determine whether repairs are needed, and identify poor-condition areas. That role assignment is an implementation detail with strategic importance. The model is being steered away from generic captioning and toward professional interpretation.
This also creates the first bias. When a prompt asks a model to determine whether repairs are needed and identify areas in poor condition, the model has a nudge toward finding actionable defects. That may be useful in triage. It is not neutral measurement. A well-run deployment would need prompts that separate observation, inference, confidence, and recommendation instead of letting them arrive as one fluent paragraph wearing a reflective vest.
The experiment tests a workflow, not an oracle
The evidence in the paper has three main parts.
First, the authors test nine image-captioning models on the NDE maps. Second, they evaluate the outputs using four qualitative criteria: relevance, usefulness, coverage, and specificity. Third, they pass the top captioning outputs into five summarization models and evaluate the resulting summaries using completeness, in-depth coverage, and formatting/presentation.
The captioning comparison is the paper’s main evidence for model selection:
| Image-captioning model | Relevance | Usefulness | Coverage | Specificity | Overall rating |
|---|---|---|---|---|---|
| Claude 3.5 Sonnet | Yes | Yes | Yes | Yes | 5 |
| ChatGPT-4 | Yes | Yes | Yes | Yes | 5 |
| CogVLM2 | Yes | Yes | Yes | Yes | 5 |
| ShareGPT4V | Yes | Yes | Yes | No | 4 |
| Florence-2 | Yes | No | Yes | No | 3 |
| Paligemma FT | Yes | No | Yes | No | 3 |
| Paligemma | Yes | No | No | No | 2 |
| BLIP large | No | No | No | No | 1 |
| vit-gpt2 | No | No | No | No | 1 |
The obvious reading is that larger, more capable multimodal systems did better. The more useful reading is that the task rewards models that can combine visual perception, domain framing, and structured explanation.
The weaker captioning models did not merely produce less poetic prose. They failed on usefulness, coverage, and specificity. In infrastructure reporting, those are not cosmetic defects. A vague but fluent output is dangerous because it looks like analysis while doing almost none of the work. This is the classic enterprise AI trap: the dashboard looks calm, therefore someone assumes the bridge is too.
The evaluation design also deserves careful handling. The scoring rubric is qualitative, built for this pilot, and assessed by the researchers and one domain expert. That makes it useful for exploratory comparison, but not equivalent to a statistically validated benchmark. There is no large labelled dataset, no prospective field deployment, and no ground-truth confusion matrix showing false positives, false negatives, or localisation error rates.
So the result is not “LLMs can inspect bridges.” The result is narrower and better: some multimodal LLMs can produce useful structured interpretations of NDE contour maps in a controlled pilot, especially when prompted as technical analysts and followed by summarisation.
That is still a meaningful result. It is just not the same result the keynote version would put on a slide with a picture of a robot dog.
Prompting is part of the system, not seasoning sprinkled after the meal
The paper states that prompt optimisation played an important role. The vit-gpt2 model was used without prompts. BLIP large was tested with and without prompts. The other models went through more extensive prompt optimisation, moving from a simple description request to a role-based professional inspection prompt.
This matters because model capability and prompt design are partly entangled. If a model performs well after careful prompting, the result belongs to the system: model plus prompt plus input formatting plus evaluation criteria. It does not belong to raw model architecture alone.
For business users, that is not a problem. Enterprises buy systems, not model purity. But it changes how the result should be interpreted.
A production bridge-inspection assistant would need a prompt library, not a single clever prompt. It would need templates for different bridge types, map scales, NDE modalities, deterioration hypotheses, confidence levels, and reporting standards. It would also need negative prompts: instructions not to infer beyond the map, not to collapse all anomalies into repair recommendations, and not to invent urgency where the evidence only supports further investigation.
The interesting deployment lesson is therefore not “use the best model.” It is “control the interpretation contract.” The prompt is that contract. It tells the model whether it is an observer, analyst, drafter, decision-support assistant, or overconfident intern with a thesaurus.
Summarisation is where the workflow becomes operational
The second model layer is easy to treat as a footnote. It is not. In real inspection workflows, summary generation is often where technical evidence becomes management action.
The paper feeds outputs from the four strongest captioning models—Claude 3.5 Sonnet, ChatGPT-4, CogVLM2, and ShareGPT4V—into five summarization models. The summarizers are then evaluated for completeness, in-depth coverage, and formatting/presentation.
| Summarization model | Completeness | In-depth coverage | Formatting and presentation | Overall rating |
|---|---|---|---|---|
| ChatGPT-4 | 5 | 5 | 5 | 5.00 |
| Claude 3.5 Sonnet | 5 | 4 | 5 | 4.67 |
| Mistral | 4 | 3 | 4 | 3.67 |
| Gemini | 3 | 3 | 4 | 3.33 |
| Llama3 | 3 | 3 | 4 | 3.33 |
ChatGPT-4 and Claude 3.5 Sonnet lead. The business implication is straightforward: the summarisation layer is not merely compressing text. It is doing synthesis, prioritisation, and report formatting. That is exactly the sort of work that consumes senior engineering attention when agencies must translate inspection evidence into maintenance planning.
The appendix examples show the potential. A Claude 3.5 Sonnet caption identifies areas such as 0–50 ft, 80–100 ft, and 220–250 ft as concerning, associating them with low cover depth, possible delamination, reduced concrete quality, corrosion risk, or higher attenuation. A ChatGPT-4 summarisation example consolidates findings into a report, highlighting regions such as 50–80 ft, 160–200 ft, and several resistivity-related areas, then recommending detailed inspection, repairs, protective overlays, corrosion mitigation, and regular monitoring.
That is useful. It is also revealing.
The appendix outputs do not perfectly align on every region. One model’s demonstration emphasises 80–100 ft and 220–250 ft; another summary emphasises 50–80 ft and 160–200 ft, with additional resistivity zones. The paper itself notes periodic inconsistencies and the need for expert validation. This is not an embarrassment. It is the point. The system should surface disagreements, not smooth them into a single confident paragraph.
A good deployment would preserve model disagreement as a diagnostic signal. If three models flag one deck segment and one model flags another, the answer is not to average them into false consensus. The answer is to show the discrepancy to the engineer and ask whether the maps, colour scales, input ordering, or modality interpretations explain it. Consensus is useful. Disagreement is often more useful, because bridges rarely fail out of respect for clean formatting.
What the paper directly shows, and what operators may infer
The paper’s direct evidence and the business interpretation should be kept separate. Otherwise this becomes marketing, and marketing has a poor record as a load-bearing material.
| Claim | What the paper directly shows | Cognaptus interpretation | Boundary |
|---|---|---|---|
| LLMs can interpret NDE contour maps | Several multimodal models produced high-rated descriptions of five maps from one bridge | Multimodal LLMs can act as first-pass interpreters for technical visualisations | One-bridge pilot; no large benchmark or prospective validation |
| Parallel captioning improves workflow structure | Multiple models generated interpretations before summarisation | Model ensembles can support cross-checking and disagreement discovery | The study does not quantify ensemble reliability statistically |
| Summarisation creates actionable reports | ChatGPT-4 and Claude produced the strongest summaries under the pilot rubric | Report drafting may be a near-term productivity use case | Generated recommendations require engineer review |
| Prompt design matters | Prompt optimisation improved output quality for capable models | Prompt libraries and task contracts will be central to deployment | Prompt performance may not generalise across agencies, map styles, or bridge types |
| LLMs can support bridge maintenance decisions | Outputs included condition summaries and repair suggestions | Agencies could use LLMs for triage, documentation, and prioritisation support | Not a substitute for inspection standards, legal accountability, or field judgement |
The strongest practical use case is not “automated inspection.” It is “inspection intelligence workbench.”
A transportation department could use such a system after NDE data collection to generate an initial narrative: where the maps agree, where they diverge, which deck regions deserve follow-up, what additional tests may be justified, and how to communicate the findings to non-specialist decision-makers. The engineer remains responsible. The model reduces blank-page time, not professional liability.
Contractors could use a similar workflow to prepare preliminary repair scopes, estimate investigation priorities, or standardise internal reporting across projects. Asset owners could use it to make technical evidence more readable for capital planning. Insurers and auditors might eventually care about the audit trail: which maps were reviewed, which prompts were used, which model versions generated the report, what the human reviewer accepted or rejected.
None of this requires pretending the model is an inspector. In fact, pretending that would make the product worse.
The bias problem is subtler than hallucination
The title of this article says “biases,” but the interesting bias here is not only hallucination. It is interpretive framing.
A model prompted as a structural engineer may produce confident engineering-style language. That can be good when the output is accurate, disciplined, and tied to visible evidence. It can be bad when the model converts uncertain map patterns into a polished maintenance narrative without preserving uncertainty.
There are at least four practical biases to manage:
| Bias | How it appears in this setting | Operational control |
|---|---|---|
| Repair-seeking bias | The prompt asks the model to determine whether repairs are needed, nudging it toward action recommendations | Separate observation, diagnosis, severity, and recommendation fields |
| Fluency bias | A well-written report may feel more reliable than it is | Require evidence references to map regions and modalities |
| Consensus bias | Summarisation may smooth over disagreements between captioning models | Preserve disagreement tables before final synthesis |
| Modality bias | The model may over-weight visually salient colour patterns without understanding measurement physics | Require modality-specific interpretation rules and expert review |
This is why “human-in-the-loop” should not mean a human rubber-stamping a beautiful PDF. It means the system should make human review easier by showing evidence, uncertainty, disagreement, and provenance. The reviewer should be able to ask: Which map supports this claim? Which model said it? Did another model disagree? Was the claim observational, inferential, or a recommendation?
Without that structure, the LLM becomes a prose machine sitting between the engineer and the evidence. That is a bad place to put a prose machine. They are charming. Charm is not calibration.
The appendix examples are demonstrations, not a second thesis
The paper’s appendix serves three likely purposes.
First, the contour-map figures document the input data. This is implementation context, not an independent validation set. The figures show what the models were asked to interpret.
Second, the prompt table is an implementation detail with practical importance. It reveals the role framing used to elicit engineering-style analysis. It also shows why prompt sensitivity must be treated as part of the system design.
Third, the Claude captioning and ChatGPT-4 summarisation examples are qualitative demonstrations. They show the kind of output the workflow can produce: region-specific observations, interpretation of possible defects, and recommendations for inspection, repair, protective overlays, corrosion mitigation, and monitoring.
What they do not prove is equally important. They do not prove that the identified segments are correct against independent ground truth. They do not prove the models would behave consistently across bridge types, map resolutions, missing legends, poor scans, unusual deterioration patterns, or adversarially confusing visualisations. They do not quantify whether a junior engineer plus LLM outperforms a senior engineer without LLM. They do not estimate time saved.
That does not weaken the paper as a pilot. It clarifies its job. Pilots should identify promising mechanisms and failure modes. They should not be forced to masquerade as full deployment trials just because everyone in AI now wants the conclusion before the method has put its shoes on.
The deployment pattern: triage first, certification later, maybe never
For an infrastructure organisation, the sensible deployment pattern is conservative and still useful.
| Workflow layer | What LLM automation can do | What humans should retain |
|---|---|---|
| Data intake | Check that required NDE maps, legends, axes, and metadata are present | Confirm data provenance and measurement validity |
| First-pass interpretation | Generate modality-specific observations and possible concern zones | Validate physical interpretation of anomalies |
| Cross-model comparison | Surface agreement and disagreement across model outputs | Decide which disagreements require field follow-up |
| Report drafting | Produce structured summaries and maintenance recommendations | Approve final language, severity, and action plan |
| Audit and governance | Store prompts, model versions, outputs, and reviewer edits | Own compliance, liability, and inspection sign-off |
This pattern has a near-term ROI pathway. Agencies and contractors spend time translating technical inspection artefacts into reports that managers can act on. If an LLM system can reduce drafting time, standardise language, highlight potential concern zones, and make NDE outputs more legible to non-specialists, it can be valuable before it ever makes a final decision.
The harder ROI claim—better safety outcomes—requires more evidence. A future study would need more bridges, independent ground truth, blinded expert comparisons, repeatability tests, uncertainty scoring, and prospective measurement of decision speed and maintenance quality. It would also need to test failure cases: noisy maps, conflicting modalities, wrong labels, missing scales, unusual bridge geometries, and data from agencies other than the one source used here.
A serious product would also need integration with existing asset management systems. The LLM should not live as a separate chat window where evidence goes to become vibes. It should sit inside a controlled workflow: ingest inspection files, produce traceable claims, link each claim to source maps, record reviewer edits, and generate final reports only after approval.
That is less glamorous than “AI inspects bridges.” It is also much closer to how infrastructure actually gets maintained.
The real lesson is translation, not replacement
The paper is valuable because it reframes multimodal LLMs as translators between measurement-heavy engineering artefacts and decision-heavy maintenance workflows. The inputs are contour maps. The intermediate outputs are model interpretations. The final product is a structured report that a human engineer can review, correct, and use.
That is a realistic role for LLMs in infrastructure: not autonomous authority, but structured cognition support.
Bridge inspection is full of expensive ambiguity. NDE maps contain signals that require expertise. Maintenance decisions require prioritisation under budget, risk, and time constraints. LLMs can help connect those layers by making technical evidence easier to inspect, compare, and communicate. They can also introduce new risks if their fluent summaries hide uncertainty or disagree quietly.
So the operating conclusion is precise. This paper does not show that LLMs are ready to certify bridge safety. It shows that a multi-stage LLM workflow can turn complex NDE contour maps into useful inspection-support narratives in a small pilot. That is not a revolution. It is a workflow wedge.
And in infrastructure, workflow wedges matter. Bridges are not maintained by breakthroughs. They are maintained by repeatable processes, defensible judgement, and paperwork that arrives before the budget meeting. Glamorous? No. Load-bearing? Potentially.
References
Cognaptus: Automate the Present, Incubate the Future. :::
-
Viraj Nishesh Darji, Callie C. Liao, and Duoduo Liao, “Automated Interpretation of Non-Destructive Evaluation Contour Maps Using Large Language Models for Bridge Condition Assessment,” arXiv:2507.14107v1, 2025. https://arxiv.org/abs/2507.14107 ↩︎