Bridges and Biases: How LLMs Are Learning to Inspect Infrastructure

In an age where aging infrastructure meets accelerating AI, a new paper out of George Mason University proposes a novel question: Can large language models interpret what even seasoned engineers find difficult — NDE contour maps of bridges? The answer, based on this pilot study, is a cautious but resounding yes — with caveats that echo through the entire field of AI-assisted engineering.

The Problem: Data Is There — Expertise Isn’t Always

Bridges are scanned using advanced non-destructive evaluation (NDE) tools — Ground Penetrating Radar (GPR), Electrical Resistivity (ER), Impact Echo (IE), and Ultrasonic Surface Waves (USW) — but interpreting those outputs requires human expertise, which is not always available, especially during emergency assessments or in rural areas. Contour maps from these tools don’t speak for themselves.

This study suggests a way forward: deploy multimodal LLMs to describe and analyze those maps, giving engineers a first-draft assessment — rich in detail, standardized across projects, and surprisingly aligned with expert judgment.

The Pipeline: LLMs as Analyst Teams

The authors designed a 3-stage process:

Image Captioning (9 Models): Each contour map — representing different NDE parameters — is run through a carefully prompted image-captioning LLM (e.g., ChatGPT-4, Claude 3.5 Sonnet, CogVLM2).
Parallel Interpretation: The outputs from each model are evaluated using 4 metrics: relevance, usefulness, coverage, and specificity.
Summarization (5 Models): The top captions are then synthesized by a second round of LLMs to produce a single cohesive condition report.

Who Won? And Why It Matters

Out of nine captioning models, four stood out:

LLM Model	Relevance	Usefulness	Coverage	Specificity	Score
Claude 3.5 Sonnet	Yes	Yes	Yes	Yes	5
ChatGPT-4	Yes	Yes	Yes	Yes	5
CogVLM2	Yes	Yes	Yes	Yes	5
ShareGPT4V	Yes	Yes	Yes	No	4

Notably, Claude and ChatGPT-4 were also the best summarizers, producing reports with engineering-grade detail and clarity.

What stands out here is not just accuracy — but consistency. Claude and GPT-4 interpreted complex attenuation maps, resistivity heatmaps, and delamination indicators in ways that aligned with human domain experts. Their summaries identified exact areas for concern — such as the 50–80 ft and 160–200 ft segments of the bridge — and proposed specific actions (reinforcement, overlay, corrosion treatment).

Why This Matters for Automation

If infrastructure inspection is going to scale, it must escape the bottleneck of expert review. This paper shows that LLMs don’t just generate captions — they scaffold decision workflows:

Highlight zones of concern from complex, heterogeneous data.
Suggest targeted interventions based on data patterns.
Generate consistent documentation for planning and audits.

This could be the beginning of a hybrid inspection stack where LLMs do the first-pass triage, and human engineers validate or override, saving time and improving consistency across states, departments, and teams.

Limitations: Prompt Quality Is Half the Battle

The weakest models — BLIP, vit-gpt2 — didn’t fail because of model size alone. They failed because they lacked role-specific prompting. The best models succeeded with rich, contextual prompts that cast the LLM as a structural engineer. This reinforces a key lesson for all enterprise AI applications: garbage prompts = garbage insight.

From Bridges to Broader Systems

While this work focuses on bridges, the architecture has legs:

Oil pipelines and power grids also rely on technical visualizations.
Manufacturing QA systems could benefit from LLMs interpreting sensor heatmaps.
Environmental monitoring via LIDAR and satellite images might be next.

What’s important is the pattern: LLMs used not just for text but as multi-modal systems that reason over physical structures. This flips the script — from office assistant to field analyst.

Final Thoughts

This paper quietly proposes a future where AI is not just in the back office — but crawling through the underbelly of bridges, whispering suggestions to engineers on where to patch, probe, or pour. It’s a preview of infrastructure automation that isn’t flashy, but absolutely necessary.

Cognaptus: Automate the Present, Incubate the Future.

The Problem: Data Is There — Expertise Isn’t Always#

The Pipeline: LLMs as Analyst Teams#

Who Won? And Why It Matters#

Why This Matters for Automation#

Limitations: Prompt Quality Is Half the Battle#

From Bridges to Broader Systems#

Final Thoughts#