When AI Reviews AI: Turning Foundation Models into Safety Inspectors

Inspection is not glamorous. It is not the robot demo, not the dashboard, not the moment a prototype obediently follows a traffic cone across a test track. Inspection is the slow, expensive discipline of asking whether the thing that worked once will behave acceptably when the weather changes, the path bends, the sensor gets confused, or the requirement was written by a tired engineer using the phrase “successfully complete” as if English were a formal language.

That is the problem behind Fighting AI with AI: Leveraging Foundation Models for Assuring AI-Enabled Safety-Critical Systems, an arXiv research idea paper by Anastasia Mavridou, Divya Gopinath, and Corina S. Păsăreanu.¹ The paper does not present a finished certification method. It proposes a toolchain architecture for using foundation models as assurance assistants in safety-critical AI systems, especially systems that combine conventional software with deep neural network perception.

The tempting headline is: AI will audit AI. Very tidy. Very fundable. Also slightly dangerous.

The more accurate reading is narrower and more useful: foundation models can help translate, test, monitor, and debug the messy semantic layer that sits between human requirements and machine behavior. They are not replacing certification authorities, formal methods, system engineers, or safety cases. They are being inserted into a pipeline where natural language requirements, formal specifications, test generation, vision-language monitoring, semantic coverage, and debugging can finally talk to each other without everyone pretending that “the rover shall complete the segment” is already precise enough.

The paper’s contribution is best understood mechanically. It is not a benchmark story. There are no grand tables proving that Model X beats Model Y by 14.7 points on Safety Assurance Leaderboard 9000. Instead, the value is in the chain: from English requirement, to restricted requirement, to formal logic, to test cases, to semantic monitoring of image sequences, to debugging perception models through human-understandable concepts.

That chain matters because safety-critical AI does not fail only at the model layer. It can fail at the sentence layer.

The assurance gap begins before the neural network sees anything

The paper starts from two problems that safety engineers already know, but that become nastier when AI enters the system.

The first is the old requirements problem. Requirements written in natural language are often ambiguous, incomplete, or inconsistent. In ordinary software projects, this is already painful. In safety-critical systems, it can be catastrophic because the requirement is supposed to become the source of truth for implementation, testing, and verification. If the source of truth is foggy, everyone downstream becomes very productive at building the wrong thing.

The second is the newer AI assurance problem. A deep neural network perception model does not operate on the same conceptual objects that humans use when writing requirements. Humans write “detect the traffic cone,” “stay on the path,” or “avoid pedestrians.” A perception model sees pixels, embeddings, scores, and internal representations that are not naturally traceable to those phrases. This is the semantic gap: requirements live in human language; DNNs live in low-level representations.

In a traditional system, a requirement can often be linked to code paths, interfaces, state transitions, and tests. In an AI-enabled system, especially one using neural perception, the route is less clean. You may know what you wanted the model to detect, but not whether its internal logic corresponds to the intended concept, whether the test set covers the right semantic conditions, or whether the runtime input sequence violates a temporal safety requirement.

The paper’s proposed answer is an integrated framework with two components:

Component	Foundation model role	Assurance stage	Core function
REACT	LLM-assisted requirements processing	Requirements engineering, formalization, analysis, test generation	Turn ambiguous English into structured and formal artifacts, with human validation
SemaLens	VLM-assisted semantic analysis	Testing, monitoring, explanation, debugging	Use human-understandable visual concepts to inspect perception models and image/video sequences

The pairing is important. REACT handles the language-to-logic side. SemaLens handles the pixels-to-concepts side. One gives requirements more precision. The other gives perception behavior more semantic visibility. Together, they try to connect what a system is supposed to do with what its perception component appears to be doing.

Or, less politely: REACT cleans up the promise; SemaLens checks whether the camera-world behavior still resembles the promise.

REACT turns English into something engineers can argue with precisely

The first half of the pipeline is REACT, short for Requirements Engineering with AI for Consistency and Testing. Its job is not to let an LLM invent requirements. That would be an exciting way to create liability.

Instead, REACT uses LLMs as assistants for moving from unrestricted natural language into more disciplined representations. The paper describes several modules: Author, Validate, Formalize, Analyze, and Generate Test Cases.

The mechanism begins with a plain-English requirement. In the paper’s example, the requirement concerns a NASA experimental rover navigating a designated path:

Once the rover is navigating a designated path, it shall continue to move and successfully complete the segment by reaching the traffic cone, i.e., the rover must demonstrate non-blocking behavior toward this goal.

At first glance, this sounds perfectly reasonable. At second glance, it is a small legal dispute disguised as a sentence.

What does “once” mean? Does the obligation begin after one frame of path detection, or only while the rover remains on the path? Does “successfully complete” mean eventually encountering the cone, reaching the cone, or proving non-blocking progress? What happens if the rover temporarily leaves the path? What exactly counts as the target being encountered?

REACT Author uses an LLM to generate candidate translations into Restricted English, a structured natural language with constrained grammar. The crucial detail is that the system may generate multiple candidates rather than pretending there is one obvious interpretation. That is not a weakness. It is the point.

A normal requirements workflow often hides ambiguity because everyone wants a clean document. REACT makes ambiguity visible. It asks, in effect: here are the possible meanings of your sentence; which one did you actually intend?

REACT Validate then helps the user choose among candidate interpretations. The paper emphasizes that validation is human-in-the-loop. The system presents semantic differences through engineer-friendly artifacts such as execution traces or concrete scenarios, so users can accept or reject candidate meanings without directly wrestling with formal logic.

That design choice matters. A fully automated LLM-to-formal-specification system would be seductive but brittle. Domain intent is not always recoverable from language alone. The engineer still has to decide. The LLM can surface alternatives; formal validation can show consequences; the human must select the intended semantics. Annoying, yes. Also known as engineering.

Once a requirement candidate is validated, REACT Formalize translates it into formal specifications using tools such as FRET and logics such as Linear Temporal Logic over finite traces, or LTLf. REACT Analyze can then check inconsistencies and conflicts across the requirement set before implementation. REACT Generate Test Cases can produce requirement-based candidate tests with coverage guarantees, linking formal requirements to test artifacts.

The paper positions this partly in relation to DO-178C-style requirements-based testing. It also notes an important boundary: DO-178 was not designed for learning-enabled components like DNNs, but safety-critical aerospace systems often combine AI and non-AI software, so conventional certification baselines still matter for the non-AI parts. Emerging guidance for AI in aviation, such as SAE G-34, is treated as complementary rather than magically solved.

This is the first business-relevant move: REACT does not make compliance automatic. It makes the requirement trail more inspectable.

The rover example is a workflow demonstration, not a performance result

The paper includes two figures. Neither should be read as an experiment.

Artifact	Likely purpose	What it supports	What it does not prove
Figure 1: Integrated framework with REACT and SemaLens	Architecture summary	The paper’s proposed assurance pipeline and how modules connect	That the full pipeline is implemented, validated, or accepted by regulators
Figure 2: Natural-language requirement to monitoring workflow	Concrete mechanism illustration	How an English requirement can move through REACT into formal logic and then SemaLens monitoring	That the approach generalizes reliably across all rover, vehicle, or aerospace scenarios
CLIP threshold example at 0.4	Implementation detail inside the monitor demonstration	How visual predicates may be evaluated from image/text similarity	That 0.4 is a universal or validated safety threshold
References to prior SemaLens-related work	Supportive prior-work connection	That components build on earlier concept-based VLM analysis and requirements-based DNN testing work	That this paper independently reports a new large-scale empirical evaluation

This distinction is important because readers coming from machine learning may instinctively search for quantitative evidence. They will not find the usual table of results. This is a research idea paper. Its evidence is architectural coherence and a worked example, not a benchmark win.

That does not make it useless. It just changes how it should be evaluated.

The question is not “how much does it improve accuracy?” The question is “does this pipeline identify the right places where assurance work currently breaks?” On that criterion, the paper is much more interesting.

The pipeline focuses on the seams: English to formal meaning, formal meaning to tests, tests to semantic scenarios, visual input to concepts, concepts to runtime monitoring, and model behavior to debugging explanations. These are exactly the places where safety-critical AI projects tend to accumulate undocumented assumptions.

SemaLens gives perception systems a semantic inspection layer

After REACT formalizes the promise, SemaLens examines perception behavior through a vision-language model lens.

The paper describes four SemaLens modules: Monitor, Image Generate, Test, and Analyze/Explain/Debug. These modules share a basic premise: VLMs such as CLIP can connect images with human-understandable textual concepts. That connection is imperfect, but useful enough to act as an inspection layer.

SemaLens Monitor handles spatial and temporal reasoning over image sequences and videos. In the rover example, REACT produces a temporal property. The property is converted into a deterministic finite automaton. To evaluate the automaton on a sequence of images, predicates such as on_path and cone_encounter must be evaluated for each image.

SemaLens does this by feeding images through CLIP and comparing image embeddings with text embeddings corresponding to the predicates. In the paper’s example, a predicate is treated as true when similarity exceeds a threshold of 0.4. The monitor then evaluates whether the image sequence satisfies the temporal property. In the shown sequence, the monitor returns true from the third image onward.

This is a compact example, but it captures the mechanism nicely:

A vague English requirement becomes a temporal property.
The temporal property becomes an automaton.
Visual predicates are evaluated using VLM image-text similarity.
The automaton checks whether the image sequence satisfies the requirement.

That is not “the AI knows the rover is safe.” It is “a requirement-derived monitor can inspect whether a visual sequence appears to satisfy specific semantic predicates over time.” Less cinematic, more useful.

SemaLens Image Generate extends the test side. The module would use text-conditional diffusion models to generate semantically diverse test images or videos that conform to natural-language requirements and input preconditions. The authors frame this as building on prior requirements-based test generation for neural networks, with additional semantic perturbations conditioned on prompts.

Here the business interpretation must be disciplined. Synthetic images can help explore rare or data-sparse conditions, but they do not automatically equal operational realism. A diffusion-generated foggy-road scenario may be useful for stress testing. It is not, by itself, proof that the deployed vehicle will handle real fog, real sensor artifacts, real lens flare, or real maintenance drift. Synthetic testing is a coverage expansion tool, not a substitute for field validation.

SemaLens Test proposes semantic coverage metrics over image sets. Instead of asking only whether a test dataset has enough images, it asks whether it covers relevant high-level features in the operational design domain: weather, lighting, obstacles, objects, path states, and other concepts. An image covers a feature if its VLM similarity score to that textual feature exceeds a user-specified threshold. Statistical measures can then summarize feature coverage.

This can work in both black-box and white-box modes. In black-box mode, SemaLens analyzes unlabeled image datasets for semantic feature coverage and gaps. In white-box mode, it maps the embedding space of a perception component to the embedding space of a VLM and computes coverage through that lens.

Finally, SemaLens AED—Analyze, Explain, and Debug—uses a VLM as a proxy to reason about a separate vision model. The paper’s example is straightforward: if a model classifies an image as a truck, engineers can ask whether relevant concepts such as “metallic” or “rectangular” are also being detected. If the model misclassifies an image, semantic mapping may help localize whether the problem lies in the vision encoder extracting the wrong concepts or in the classifier head using concepts badly.

This is where the “safety inspector” metaphor becomes useful. SemaLens is not a judge. It is more like an inspector with a checklist written in human concepts: path, cone, pedestrian, fog, obstacle, metallic, rectangular, unusual sequence, semantic deviation. The checklist may be incomplete. Some items may be difficult to detect reliably. But without such a checklist, the system remains trapped between raw pixels and vague assurance claims.

The real integration is vocabulary discipline

The strongest idea in the paper is not simply “use LLMs” or “use VLMs.” That is now the default seasoning on every AI systems paper. The stronger idea is that the two model families can be tied together by a shared semantic vocabulary.

REACT extracts and formalizes requirements. Those requirements contain concepts that matter to system behavior. SemaLens can use those same high-level concepts as predicates for monitoring, coverage metrics, test generation, and debugging.

That turns requirements from static documentation into operational inspection hooks.

Consider a simplified version of the flow:

Stage	Artifact produced	Operational use
Plain English requirement	Ambiguous user intent	Starting point for stakeholder review
Restricted English candidate	Explicit candidate meaning	Human validation of intended semantics
Formal specification	Temporal/logical property	Consistency analysis and test generation
Requirement-derived predicates	Concepts such as `on_path`, `cone_encounter`	Semantic monitoring of visual sequences
Test sequences and generated images/videos	Requirement-aligned scenarios	Perception stress testing
Semantic coverage and heatmaps	Concept-level behavior profile	Dataset gap analysis, debugging, runtime anomaly flags

The key is traceability. A concept that appears in a requirement can reappear in tests, monitors, coverage reports, and debugging tools. This is what businesses in regulated AI should care about. Not because traceability sounds impressive in a slide deck, although it absolutely does. Because traceability determines whether teams can explain why a test exists, what requirement it covers, what model behavior it probes, and what kind of failure it is supposed to reveal.

Without that chain, AI assurance tends to degrade into disconnected artifacts: a requirements document, a model card, a test dataset, some simulation logs, a monitoring dashboard, and a prayer.

The paper’s framework suggests how those artifacts could become part of one assurance story.

What the paper directly shows, and what business readers should infer carefully

Because this is a proposal paper, the evidence should be interpreted cautiously but not dismissed. The authors directly show a conceptual architecture and a worked path-completion example. They do not show a validated industrial deployment, a regulator-approved process, or a comprehensive empirical evaluation.

The practical business reading should therefore separate direct contribution from plausible operational value.

Paper element	What the paper directly shows	Cognaptus business interpretation	Boundary
REACT Author and Validate	LLMs can generate multiple structured interpretations and support human selection	Reduces hidden ambiguity before requirements become expensive downstream errors	Human validation remains essential; LLM interpretation is not authoritative
REACT Formalize and Analyze	Validated structured requirements can be translated into formal logic and analyzed	Creates earlier consistency checks and stronger audit artifacts	Requires formal-methods integration and domain-specific requirement patterns
REACT Generate Test Cases	Formal requirements can drive candidate test generation with traceability	Improves linkage between requirement, test, and evidence	Coverage guarantees depend on the formalization and test-generation method
SemaLens Monitor	VLM-derived predicates can feed temporal monitors over image sequences	Makes runtime and offline video analysis more requirement-aware	VLM spatial reasoning and threshold calibration remain hard
SemaLens Image Generate	Diffusion models can be proposed for semantically diverse test inputs	Helps explore rare, costly, or data-sparse scenarios	Synthetic realism and operational representativeness must be validated
SemaLens Test	VLMs can define semantic feature coverage over image sets	Moves dataset evaluation closer to operational design domain concepts	Coverage depends on concept vocabulary, thresholds, and VLM reliability
SemaLens AED	VLM alignment can support concept-based explanation and debugging	Makes model failures easier to triage and communicate	Explanations are proxies, not guaranteed causal accounts

For companies working on autonomous systems, robotics, aerospace AI, industrial inspection, or safety-critical perception, the near-term value is not “automatic certification.” It is cheaper diagnosis and better traceability.

That matters because assurance cost is not only the cost of tests. It is the cost of discovering too late that tests were aimed at the wrong interpretation of the requirement, or that a dataset was large but semantically thin, or that a model passed ordinary accuracy checks while failing the concept that the safety case actually depends on.

A framework like REACT plus SemaLens is valuable if it can shift discovery earlier: ambiguous requirement before design freeze, semantic gap before deployment, brittle concept before field incident, monitoring violation before accident investigation.

The business value is earlier disagreement, not smoother agreement

The phrase “AI-assisted requirements engineering” can sound like a productivity story: write requirements faster, generate tests faster, reduce manual work. The paper does make those claims. But the more serious value is not speed. It is structured disagreement.

REACT’s multiple candidate translations are valuable because they reveal that stakeholders may not share the same interpretation. In a normal meeting, this disagreement may stay hidden because the sentence sounds reasonable. In a formalized workflow, the sentence branches into alternatives. Someone has to choose.

That is useful friction.

Similarly, SemaLens semantic coverage may reveal that a dataset is not actually covering the operational design domain that the team thought it was covering. A test set may contain many road images but few relevant combinations of lighting, obstacles, path states, and object encounters. Again, the system creates disagreement: between the dataset’s apparent size and its semantic usefulness.

Runtime monitoring creates another form of disagreement: between what the model outputs and what requirement-derived predicates suggest should be true over time. Debugging heatmaps create disagreement between a model’s predicted label and the concepts it appears to rely on.

In safety-critical work, these disagreements are not bugs in the process. They are the process.

The business problem is that organizations often discover disagreements only after integration, late-stage testing, or deployment. The paper’s framework tries to move them into the development pipeline, where they are still embarrassing rather than catastrophic. A charming distinction.

What this does not solve yet

The paper is careful enough to leave several hard problems visible. Business readers should keep them visible too.

First, VLMs still struggle with complex spatial relationships. The authors explicitly acknowledge this in the SemaLens Monitor discussion. That matters because safety-critical perception often depends not only on object presence but on spatial relation, motion, timing, occlusion, and context. “Cone visible” is easier than “the rover is safely progressing toward the cone while remaining on the intended path under partial occlusion.”

Second, threshold selection is unresolved. In the example, a CLIP similarity threshold of 0.4 is used to evaluate predicates. For a demonstration, that is fine. For a safety case, threshold calibration becomes a major issue. A threshold determines when a concept is treated as present or absent. In a temporal monitor, that decision can flip the satisfaction of a requirement. Thresholds therefore need validation, sensitivity analysis, and domain-specific justification.

Third, diffusion-generated tests require realism controls. Synthetic inputs are attractive because rare scenarios are expensive to collect. But generated images can also introduce artifacts, omit sensor-specific noise, or create scenes that satisfy a prompt while failing to represent operational reality. The right business use is not “replace real-world testing.” It is “expand the scenario search space, then validate the generated cases against operational constraints.”

Fourth, explanations are not causality by default. A VLM-aligned concept map can suggest that a model’s behavior correlates with certain concepts, but that does not necessarily prove causal reliance. Concept-based debugging can guide investigation. It should not be oversold as a complete explanation of model decision logic.

Fifth, certification acceptance remains a separate institutional problem. The paper references standards and emerging guidance, but a proposed toolchain is not automatically a certifiable process. Regulators and certification bodies will need evidence about tool reliability, human oversight, failure modes, validation procedures, and how generated artifacts fit into accepted assurance arguments.

Finally, the framework is broad. That is a strength for an idea paper, but an implementation challenge for a product team. A real deployment would need decisions about requirement templates, supported logics, VLM choice, concept vocabularies, threshold governance, dataset management, monitoring latency, audit storage, and integration with existing safety engineering workflows. The diagram is clean. The procurement checklist will not be.

The obvious summary misses the paper’s real point

A shallow summary would say: the paper proposes using LLMs and VLMs to assure safety-critical AI systems. True, but not very helpful.

The better interpretation is that the paper proposes a semantic assurance pipeline. It uses LLMs to expose and discipline the meaning of requirements, then uses VLMs to reconnect perception behavior to those meanings through concepts, monitors, tests, and debugging artifacts.

That is why the mechanism-first reading matters. The contribution is not a single model, metric, or experiment. It is the proposed continuity from language to logic to perception inspection.

For business leaders, this changes the buying question. The question is not “Can we buy an AI safety inspector?” The question is “Where does our current assurance process lose semantic traceability?”

Possible answers:

Requirements are written in English but never formalized.
Formal tests exist but do not connect to perception concepts.
Datasets are measured by volume rather than operational semantic coverage.
Runtime monitoring catches technical anomalies but not requirement-level semantic violations.
Debugging tools show saliency-like artifacts but not concept-level failure modes.
Safety evidence exists, but each artifact lives in a different room, guarded by a different team, speaking a different dialect.

REACT and SemaLens are interesting because they target those failure points. Not perfectly. Not finally. But directly.

The practical takeaway: treat foundation models as inspection scaffolding

The title phrase “fighting AI with AI” is memorable, but it risks overselling the autonomy of the approach. A better business phrase would be “using foundation models as inspection scaffolding.”

Scaffolding is temporary, structured, and supportive. It lets human engineers reach parts of the system that are otherwise hard to inspect. It does not become the building. It does not certify the building. It helps people work on the building without falling off it.

That is the right mental model for this paper.

LLMs help expose ambiguity and connect requirements to formal artifacts. VLMs help project visual behavior into human concepts that can be monitored, tested, and debugged. Formal methods and human validation keep the process anchored. The result is not autonomous assurance. It is a more inspectable assurance workflow for systems where English requirements and neural perception currently pass each other in the hallway without making eye contact.

For safety-critical AI, that may be exactly the kind of unglamorous progress that matters.

Cognaptus: Automate the Present, Incubate the Future.

Anastasia Mavridou, Divya Gopinath, and Corina S. Păsăreanu, “Fighting AI with AI: Leveraging Foundation Models for Assuring AI-Enabled Safety-Critical Systems,” arXiv:2511.20627, 2025. ↩︎

The assurance gap begins before the neural network sees anything#

REACT turns English into something engineers can argue with precisely#

The rover example is a workflow demonstration, not a performance result#

SemaLens gives perception systems a semantic inspection layer#

The real integration is vocabulary discipline#

What the paper directly shows, and what business readers should infer carefully#

The business value is earlier disagreement, not smoother agreement#

What this does not solve yet#

The obvious summary misses the paper’s real point#

The practical takeaway: treat foundation models as inspection scaffolding#