Bridges and Biases: How LLMs Are Learning to Inspect Infrastructure

TL;DR for operators

Bridge teams do not usually lack data. They lack enough expert time to turn dense inspection data into clear, defensible decisions. That is the operational gap this paper tries to narrow: not by replacing bridge engineers with a chatbot in a hard hat, thankfully, but by using multimodal LLMs to translate non-destructive evaluation contour maps into structured condition assessments and maintenance recommendations.¹

The paper proposes a three-stage workflow. First, five NDE contour maps from one bridge are prepared as inputs: two Ground Penetrating Radar views, Electrical Resistivity, Impact-Echo, and Ultrasonic Surface Waves. Second, multiple image-captioning models interpret the maps in parallel. Third, the stronger interpretations are fed into summarization models that consolidate them into a bridge-condition report.

The strongest captioning models in the pilot were Claude 3.5 Sonnet, ChatGPT-4, CogVLM2, and ShareGPT4V. Claude, ChatGPT-4, and CogVLM2 received the top overall image-captioning rating of 5, while ShareGPT4V received 4 because it lacked specificity. For summarization, ChatGPT-4 scored 5.00 and Claude 3.5 Sonnet scored 4.67, ahead of Mistral, Gemini, and Llama3.

For operators, the immediate business value is not autonomous bridge certification. It is faster first-pass triage, more consistent report drafting, and a better interface between technical NDE specialists and budget-holding decision-makers. The dull but essential phrase is “human-in-the-loop.” The study itself practically begs for it.

The main boundary is scale. The experiment uses one FHWA bridge case in Mississippi, five contour maps, qualitative ratings, researchers plus one domain expert, and prompt-optimised model outputs. That is enough to show a plausible workflow. It is not enough to hand an LLM a capital maintenance budget and tell it to enjoy itself.

The machine does not inspect the bridge; it organises the evidence

A bridge inspection file can contain several kinds of truth at once. One map hints at cover depth. Another suggests moisture or deterioration. Another points toward corrosion risk. Another says something about delamination. Another approximates concrete stiffness. Individually, each is a technical signal. Together, they become an argument about where the bridge may be weakening.

That “together” is the expensive part.

The paper’s useful move is not simply asking a model to caption an image. Captioning is the visible surface. The deeper mechanism is workflow orchestration: feed several technical contour maps into multiple multimodal models, compare their descriptive outputs, then use a second LLM layer to consolidate the interpretations into a condition summary.

A simplified version looks like this:

Stage	Input	Model role	Output
Data preparation	Five NDE contour maps	Standardise visual inputs and prompt context	A common inspection task for each model
Parallel captioning	GPR, ER, IE, USW maps	Describe condition signals, likely defects, and poor-condition areas	Multiple model interpretations
Model filtering	Caption outputs	Select stronger interpretations based on relevance, usefulness, coverage, specificity	Shortlist of high-quality captions
Summarisation	Top caption outputs	Consolidate findings into one report	Condition summary and recommendations
Human review	Generated report	Validate against engineering judgement and field evidence	Inspection support, not final authority

This is why a mechanism-first reading matters. A leaderboard summary would say “GPT-4 and Claude did well.” True, but incomplete. The more important claim is that LLMs may become a connective layer inside infrastructure inspection workflows, where the bottleneck is not only seeing the signal but turning heterogeneous signals into a usable maintenance narrative.

That distinction matters because infrastructure agencies do not procure “interesting model behaviour.” They procure workflows that shorten cycle time, reduce ambiguity, and survive audit. The bridge does not care whether the summary sounds elegant. The procurement officer, engineer, and liability insurer rather do.

Five contour maps are five different languages of deterioration

The study uses five contour maps from one bridge in the Federal Highway Administration bridge database. The selected bridge is in Mississippi, with Structure Number 11002200250051B and LTBP Bridge Number 28-000008. The authors note that, as of July 23, 2024, 38 bridges in the database included NDE data, but this pilot examines only one bridge.

The five maps are not redundant images of the same thing. They represent different measurement technologies and different physical interpretations:

NDE map	What it measures in the paper	What it can contribute to assessment
GPR cover depth	Cover depth in inches	Whether reinforcing steel has enough concrete cover for protection
GPR DCA	Depth-Corrected Attenuation at top rebar level, in decibels	Possible deterioration, moisture-related effects, or internal condition trends
Electrical Resistivity	Resistivity patterns	Concrete properties and corrosion activity risk
Impact-Echo	Frequency in kHz	Concrete integrity and possible delamination
Ultrasonic Surface Waves	Modulus in ksi	Concrete mechanical properties and stiffness variation

This is where multimodal LLMs become interesting, and also dangerous. They are not merely reading a photograph of a crack. They are interpreting technical visualisations that encode physical measurements. The colours and spatial patterns are not decoration. They are the data.

A conventional image-captioning model can get away with saying “a dog sitting on grass.” An infrastructure model must do something more awkward: infer what a low-frequency patch, a resistivity anomaly, or a lower-modulus region might mean for a specific part of a bridge deck. That is less like describing an image and more like narrating an instrument panel.

The paper’s prompt reflects this. It tells the model to act as a structural engineer, notes that the x- and y-axes are in feet, and asks it to describe bridge condition, determine whether repairs are needed, and identify poor-condition areas. That role assignment is an implementation detail with strategic importance. The model is being steered away from generic captioning and toward professional interpretation.

This also creates the first bias. When a prompt asks a model to determine whether repairs are needed and identify areas in poor condition, the model has a nudge toward finding actionable defects. That may be useful in triage. It is not neutral measurement. A well-run deployment would need prompts that separate observation, inference, confidence, and recommendation instead of letting them arrive as one fluent paragraph wearing a reflective vest.

The experiment tests a workflow, not an oracle

The evidence in the paper has three main parts.

First, the authors test nine image-captioning models on the NDE maps. Second, they evaluate the outputs using four qualitative criteria: relevance, usefulness, coverage, and specificity. Third, they pass the top captioning outputs into five summarization models and evaluate the resulting summaries using completeness, in-depth coverage, and formatting/presentation.

The captioning comparison is the paper’s main evidence for model selection:

Image-captioning model	Relevance	Usefulness	Coverage	Specificity	Overall rating
Claude 3.5 Sonnet	Yes	Yes	Yes	Yes	5
ChatGPT-4	Yes	Yes	Yes	Yes	5
CogVLM2	Yes	Yes	Yes	Yes	5
ShareGPT4V	Yes	Yes	Yes	No	4
Florence-2	Yes	No	Yes	No	3
Paligemma FT	Yes	No	Yes	No	3
Paligemma	Yes	No	No	No	2
BLIP large	No	No	No	No	1
vit-gpt2	No	No	No	No	1

The obvious reading is that larger, more capable multimodal systems did better. The more useful reading is that the task rewards models that can combine visual perception, domain framing, and structured explanation.

The weaker captioning models did not merely produce less poetic prose. They failed on usefulness, coverage, and specificity. In infrastructure reporting, those are not cosmetic defects. A vague but fluent output is dangerous because it looks like analysis while doing almost none of the work. This is the classic enterprise AI trap: the dashboard looks calm, therefore someone assumes the bridge is too.

The evaluation design also deserves careful handling. The scoring rubric is qualitative, built for this pilot, and assessed by the researchers and one domain expert. That makes it useful for exploratory comparison, but not equivalent to a statistically validated benchmark. There is no large labelled dataset, no prospective field deployment, and no ground-truth confusion matrix showing false positives, false negatives, or localisation error rates.

So the result is not “LLMs can inspect bridges.” The result is narrower and better: some multimodal LLMs can produce useful structured interpretations of NDE contour maps in a controlled pilot, especially when prompted as technical analysts and followed by summarisation.

That is still a meaningful result. It is just not the same result the keynote version would put on a slide with a picture of a robot dog.

Prompting is part of the system, not seasoning sprinkled after the meal

The paper states that prompt optimisation played an important role. The vit-gpt2 model was used without prompts. BLIP large was tested with and without prompts. The other models went through more extensive prompt optimisation, moving from a simple description request to a role-based professional inspection prompt.

This matters because model capability and prompt design are partly entangled. If a model performs well after careful prompting, the result belongs to the system: model plus prompt plus input formatting plus evaluation criteria. It does not belong to raw model architecture alone.

For business users, that is not a problem. Enterprises buy systems, not model purity. But it changes how the result should be interpreted.

A production bridge-inspection assistant would need a prompt library, not a single clever prompt. It would need templates for different bridge types, map scales, NDE modalities, deterioration hypotheses, confidence levels, and reporting standards. It would also need negative prompts: instructions not to infer beyond the map, not to collapse all anomalies into repair recommendations, and not to invent urgency where the evidence only supports further investigation.

The interesting deployment lesson is therefore not “use the best model.” It is “control the interpretation contract.” The prompt is that contract. It tells the model whether it is an observer, analyst, drafter, decision-support assistant, or overconfident intern with a thesaurus.

Summarisation is where the workflow becomes operational

The second model layer is easy to treat as a footnote. It is not. In real inspection workflows, summary generation is often where technical evidence becomes management action.

The paper feeds outputs from the four strongest captioning models—Claude 3.5 Sonnet, ChatGPT-4, CogVLM2, and ShareGPT4V—into five summarization models. The summarizers are then evaluated for completeness, in-depth coverage, and formatting/presentation.

Summarization model	Completeness	In-depth coverage	Formatting and presentation	Overall rating
ChatGPT-4	5	5	5	5.00
Claude 3.5 Sonnet	5	4	5	4.67
Mistral	4	3	4	3.67
Gemini	3	3	4	3.33
Llama3	3	3	4	3.33

ChatGPT-4 and Claude 3.5 Sonnet lead. The business implication is straightforward: the summarisation layer is not merely compressing text. It is doing synthesis, prioritisation, and report formatting. That is exactly the sort of work that consumes senior engineering attention when agencies must translate inspection evidence into maintenance planning.

The appendix examples show the potential. A Claude 3.5 Sonnet caption identifies areas such as 0–50 ft, 80–100 ft, and 220–250 ft as concerning, associating them with low cover depth, possible delamination, reduced concrete quality, corrosion risk, or higher attenuation. A ChatGPT-4 summarisation example consolidates findings into a report, highlighting regions such as 50–80 ft, 160–200 ft, and several resistivity-related areas, then recommending detailed inspection, repairs, protective overlays, corrosion mitigation, and regular monitoring.

That is useful. It is also revealing.

The appendix outputs do not perfectly align on every region. One model’s demonstration emphasises 80–100 ft and 220–250 ft; another summary emphasises 50–80 ft and 160–200 ft, with additional resistivity zones. The paper itself notes periodic inconsistencies and the need for expert validation. This is not an embarrassment. It is the point. The system should surface disagreements, not smooth them into a single confident paragraph.

A good deployment would preserve model disagreement as a diagnostic signal. If three models flag one deck segment and one model flags another, the answer is not to average them into false consensus. The answer is to show the discrepancy to the engineer and ask whether the maps, colour scales, input ordering, or modality interpretations explain it. Consensus is useful. Disagreement is often more useful, because bridges rarely fail out of respect for clean formatting.

What the paper directly shows, and what operators may infer

The paper’s direct evidence and the business interpretation should be kept separate. Otherwise this becomes marketing, and marketing has a poor record as a load-bearing material.

Claim	What the paper directly shows	Cognaptus interpretation	Boundary
LLMs can interpret NDE contour maps	Several multimodal models produced high-rated descriptions of five maps from one bridge	Multimodal LLMs can act as first-pass interpreters for technical visualisations	One-bridge pilot; no large benchmark or prospective validation
Parallel captioning improves workflow structure	Multiple models generated interpretations before summarisation	Model ensembles can support cross-checking and disagreement discovery	The study does not quantify ensemble reliability statistically
Summarisation creates actionable reports	ChatGPT-4 and Claude produced the strongest summaries under the pilot rubric	Report drafting may be a near-term productivity use case	Generated recommendations require engineer review
Prompt design matters	Prompt optimisation improved output quality for capable models	Prompt libraries and task contracts will be central to deployment	Prompt performance may not generalise across agencies, map styles, or bridge types
LLMs can support bridge maintenance decisions	Outputs included condition summaries and repair suggestions	Agencies could use LLMs for triage, documentation, and prioritisation support	Not a substitute for inspection standards, legal accountability, or field judgement

The strongest practical use case is not “automated inspection.” It is “inspection intelligence workbench.”

A transportation department could use such a system after NDE data collection to generate an initial narrative: where the maps agree, where they diverge, which deck regions deserve follow-up, what additional tests may be justified, and how to communicate the findings to non-specialist decision-makers. The engineer remains responsible. The model reduces blank-page time, not professional liability.

Contractors could use a similar workflow to prepare preliminary repair scopes, estimate investigation priorities, or standardise internal reporting across projects. Asset owners could use it to make technical evidence more readable for capital planning. Insurers and auditors might eventually care about the audit trail: which maps were reviewed, which prompts were used, which model versions generated the report, what the human reviewer accepted or rejected.

None of this requires pretending the model is an inspector. In fact, pretending that would make the product worse.

The bias problem is subtler than hallucination

The title of this article says “biases,” but the interesting bias here is not only hallucination. It is interpretive framing.

A model prompted as a structural engineer may produce confident engineering-style language. That can be good when the output is accurate, disciplined, and tied to visible evidence. It can be bad when the model converts uncertain map patterns into a polished maintenance narrative without preserving uncertainty.

There are at least four practical biases to manage:

Bias	How it appears in this setting	Operational control
Repair-seeking bias	The prompt asks the model to determine whether repairs are needed, nudging it toward action recommendations	Separate observation, diagnosis, severity, and recommendation fields
Fluency bias	A well-written report may feel more reliable than it is	Require evidence references to map regions and modalities
Consensus bias	Summarisation may smooth over disagreements between captioning models	Preserve disagreement tables before final synthesis
Modality bias	The model may over-weight visually salient colour patterns without understanding measurement physics	Require modality-specific interpretation rules and expert review

This is why “human-in-the-loop” should not mean a human rubber-stamping a beautiful PDF. It means the system should make human review easier by showing evidence, uncertainty, disagreement, and provenance. The reviewer should be able to ask: Which map supports this claim? Which model said it? Did another model disagree? Was the claim observational, inferential, or a recommendation?

Without that structure, the LLM becomes a prose machine sitting between the engineer and the evidence. That is a bad place to put a prose machine. They are charming. Charm is not calibration.

The appendix examples are demonstrations, not a second thesis

The paper’s appendix serves three likely purposes.

First, the contour-map figures document the input data. This is implementation context, not an independent validation set. The figures show what the models were asked to interpret.

Second, the prompt table is an implementation detail with practical importance. It reveals the role framing used to elicit engineering-style analysis. It also shows why prompt sensitivity must be treated as part of the system design.

Third, the Claude captioning and ChatGPT-4 summarisation examples are qualitative demonstrations. They show the kind of output the workflow can produce: region-specific observations, interpretation of possible defects, and recommendations for inspection, repair, protective overlays, corrosion mitigation, and monitoring.

What they do not prove is equally important. They do not prove that the identified segments are correct against independent ground truth. They do not prove the models would behave consistently across bridge types, map resolutions, missing legends, poor scans, unusual deterioration patterns, or adversarially confusing visualisations. They do not quantify whether a junior engineer plus LLM outperforms a senior engineer without LLM. They do not estimate time saved.

That does not weaken the paper as a pilot. It clarifies its job. Pilots should identify promising mechanisms and failure modes. They should not be forced to masquerade as full deployment trials just because everyone in AI now wants the conclusion before the method has put its shoes on.

The deployment pattern: triage first, certification later, maybe never

For an infrastructure organisation, the sensible deployment pattern is conservative and still useful.

Workflow layer	What LLM automation can do	What humans should retain
Data intake	Check that required NDE maps, legends, axes, and metadata are present	Confirm data provenance and measurement validity
First-pass interpretation	Generate modality-specific observations and possible concern zones	Validate physical interpretation of anomalies
Cross-model comparison	Surface agreement and disagreement across model outputs	Decide which disagreements require field follow-up
Report drafting	Produce structured summaries and maintenance recommendations	Approve final language, severity, and action plan
Audit and governance	Store prompts, model versions, outputs, and reviewer edits	Own compliance, liability, and inspection sign-off

This pattern has a near-term ROI pathway. Agencies and contractors spend time translating technical inspection artefacts into reports that managers can act on. If an LLM system can reduce drafting time, standardise language, highlight potential concern zones, and make NDE outputs more legible to non-specialists, it can be valuable before it ever makes a final decision.

The harder ROI claim—better safety outcomes—requires more evidence. A future study would need more bridges, independent ground truth, blinded expert comparisons, repeatability tests, uncertainty scoring, and prospective measurement of decision speed and maintenance quality. It would also need to test failure cases: noisy maps, conflicting modalities, wrong labels, missing scales, unusual bridge geometries, and data from agencies other than the one source used here.

A serious product would also need integration with existing asset management systems. The LLM should not live as a separate chat window where evidence goes to become vibes. It should sit inside a controlled workflow: ingest inspection files, produce traceable claims, link each claim to source maps, record reviewer edits, and generate final reports only after approval.

That is less glamorous than “AI inspects bridges.” It is also much closer to how infrastructure actually gets maintained.

The real lesson is translation, not replacement

The paper is valuable because it reframes multimodal LLMs as translators between measurement-heavy engineering artefacts and decision-heavy maintenance workflows. The inputs are contour maps. The intermediate outputs are model interpretations. The final product is a structured report that a human engineer can review, correct, and use.

That is a realistic role for LLMs in infrastructure: not autonomous authority, but structured cognition support.

Bridge inspection is full of expensive ambiguity. NDE maps contain signals that require expertise. Maintenance decisions require prioritisation under budget, risk, and time constraints. LLMs can help connect those layers by making technical evidence easier to inspect, compare, and communicate. They can also introduce new risks if their fluent summaries hide uncertainty or disagree quietly.

So the operating conclusion is precise. This paper does not show that LLMs are ready to certify bridge safety. It shows that a multi-stage LLM workflow can turn complex NDE contour maps into useful inspection-support narratives in a small pilot. That is not a revolution. It is a workflow wedge.

And in infrastructure, workflow wedges matter. Bridges are not maintained by breakthroughs. They are maintained by repeatable processes, defensible judgement, and paperwork that arrives before the budget meeting. Glamorous? No. Load-bearing? Potentially.

References

Cognaptus: Automate the Present, Incubate the Future. :::

Viraj Nishesh Darji, Callie C. Liao, and Duoduo Liao, “Automated Interpretation of Non-Destructive Evaluation Contour Maps Using Large Language Models for Bridge Condition Assessment,” arXiv:2507.14107v1, 2025. https://arxiv.org/abs/2507.14107 ↩︎

TL;DR for operators#

The machine does not inspect the bridge; it organises the evidence#

Five contour maps are five different languages of deterioration#

The experiment tests a workflow, not an oracle#

Prompting is part of the system, not seasoning sprinkled after the meal#

Summarisation is where the workflow becomes operational#

What the paper directly shows, and what operators may infer#

The bias problem is subtler than hallucination#

The appendix examples are demonstrations, not a second thesis#

The deployment pattern: triage first, certification later, maybe never#

The real lesson is translation, not replacement#

References#