Mapping the Unknown: Turning AI Safety from Space into Proof

Proof sounds like a courtroom word. In safety-critical AI, it is more like warehouse management.

First, define the space. Then label the shelves. Then check what is actually on them. Then find the empty slots. Then fill them deliberately rather than hoping the next random delivery truck brings exactly what the regulator asked for. Not glamorous. Also not optional.

That is the practical charm of From High-Dimensional Spaces to Verifiable ODD Coverage for Safety-Critical AI-based Systems.¹ The paper is not selling a more heroic neural network. It is not claiming that simulation alone can certify AI. It is doing something more useful and less theatrical: it turns the Operational Design Domain, or ODD, into a coverage object that engineers can inspect, measure, and iterate.

For a regulated AI system, that distinction matters. A model can perform well on a test set and still leave parts of the real operating envelope untouched. In aviation, autonomous mobility, medical devices, industrial robotics, and critical infrastructure, “we tested a lot” is not the same as “we know which operational regions remain untested.” The former is a comforting sentence. The latter is evidence.

The paper’s central contribution is therefore a workflow. It takes an abstract regulatory demand—show that the AI/ML constituent has sufficiently complete and representative coverage of its ODD—and turns it into a sequence of engineering operations: discretize parameters, reduce dimensions where justified, filter infeasible or less critical combinations, account for dependencies, test joint coverage, and generate new scenarios for uncovered regions.

The important word is joint. Testing each parameter separately is easy to explain and easy to overtrust. Coverage of altitude, speed, time-to-conflict, weather, or advisory state one by one does not imply coverage of their combinations. Reality, being impolite, tends to arrive as combinations.

The safety problem is not a missing metric; it is unvisited territory

The ODD is supposed to define where a system is intended to operate. In a simple world, that might be a neat list of conditions. In a real system, it becomes a multi-dimensional space: relative altitude, vertical rates, time to loss of separation, previous system advisory, sensor conditions, environmental factors, and so on.

Once those dimensions interact, the number of possible states grows quickly. This is the ordinary curse of dimensionality, now wearing a certification badge.

The paper positions itself against three familiar partial solutions.

Existing approach	What it helps with	What it does not solve
Simulation sampling	Produces many scenarios at scale	Does not prove that important regions were covered
Geometry-based coverage	Gives visible boundaries around sampled data	Can miss internal holes and density gaps
Statistical dependency methods	Capture relationships among parameters	Can become expensive or difficult in larger spaces

The authors’ answer is not to discard these tools. It is to embed them inside a process that asks a stricter question: after we define the relevant ODD, which joint bin combinations have actually been exercised?

That question is more annoying than an aggregate accuracy score. Naturally, it is also more useful.

The paper’s real object is a verification loop

The mechanism begins with a defined ODD and a database of executed scenarios. Continuous parameters cannot be exhaustively tested as continuous values, so the method first converts them into bins. This is the move from infinite smooth space to finite accountable space.

But binning is not a clerical detail. It controls the resolution of the safety argument. If bins are too large, safety-relevant gaps disappear inside broad categories. If bins are too small, the coverage problem explodes and the project becomes a shrine to computational suffering.

The paper’s answer is criticality-aware discretization. Parameters, or regions within parameters, can receive different bin resolutions depending on how safety-critical they are. Higher criticality deserves finer resolution. Lower criticality can be treated more coarsely. This is not a mathematical luxury. It is how an engineering team avoids spending equal test effort on situations that do not carry equal safety meaning.

After binning, the method checks whether parameter grouping is justified. Related variables can sometimes be merged into a higher-level condition. The paper gives the intuitive example of precipitation type and precipitation intensity becoming a broader precipitation condition. The benefit is dimensionality reduction. The danger is that grouping can hide meaningful distinctions. The authors explicitly warn that grouping should be validated, for example through sensitivity analysis or temporary refinement of grouped ranges, to ensure it does not erase safety-critical behavior.

Then comes constraint definition. This is where the method becomes useful rather than merely tidy. Not every combination of parameter values is physically plausible, operationally relevant, or equally critical. If a combination cannot happen, or if it corresponds to a lower-risk region, it should not automatically consume the same verification budget as a realistic high-risk encounter.

A less disciplined organization might call this “pruning.” A regulator will want to know why the pruning is legitimate. The paper’s workflow forces that justification into the method: constraints must be explicit, domain-grounded, and inspectable.

Dependency modeling is the next layer. ODD parameters are rarely independent. Weather variables, aircraft states, sensor conditions, traffic geometry, and human-system context can co-vary. If a coverage process assumes independence everywhere, it may waste effort on unrealistic combinations while missing the structure of realistic ones. The paper discusses advanced dependency modeling, including vine copulas, as one way to represent such relationships in larger or more stochastic ODDs. In the specific VerticalCAS case study, however, the authors do not use complex dependency modeling because the selected state variables are governed mainly by deterministic kinematic constraints and the dimensionality is relatively low.

That matters for interpretation. The paper is not claiming that one small aviation case proves a general dependency model. It is showing where dependency modeling belongs in the assurance workflow.

The coverage formula is simple because the difficulty is upstream

Once parameters are discretized, the relevant space becomes a Cartesian product of bins:

$$ B = B_1 \times B_2 \times \cdots \times B_n $$

A bin combination is covered if at least one executed scenario contains a data point within that combination’s parameter ranges. Coverage is then measured as:

$$ r_{cov} = \frac{|B_{relevant, covered}|}{|B_{relevant}|} $$

For the completeness requirement, the target is uncompromising:

$$ r_{cov} = 1 $$

This is why the method is not just another sampling metric. It produces a list of missing bin combinations. Those missing combinations can be translated back into physical scenario parameters, executed, added to the database, and tested again. The workflow is therefore iterative: define the space, measure gaps, generate scenarios, retest.

That loop is the practical contribution. A coverage deficit is not merely a bad score. It becomes an engineering worklist.

The VerticalCAS case study is a process demonstration, not a victory lap

The paper applies the method to VerticalCAS, an AI-based vertical collision avoidance setting. The ODD in the demonstration is built from state variables including relative intruder altitude, ownship vertical rate, intruder vertical rate, time to loss of horizontal separation, and the previous advisory indicator.

The authors use 1.97 million rows of previously gathered VerticalCAS state-variable data. This sounds large, and it is large in the everyday sense. In coverage terms, however, “large” is an unhelpful adjective unless we know what space it was supposed to cover.

The paper’s state-space construction shows why. Relative altitude is binned across a range from -1500 to 1500 feet. Ownship and intruder vertical rates are considered across -3200 to 3200 ft/min. Time to loss of horizontal separation is binned over 0 to 60 seconds. The previous advisory variable originally has nine possible values, but the authors abstract it to a single bin because, for this demonstration, they focus on current encounter geometry rather than advisory history.

The paper reports that this simplification reduces the adjusted state space to 195,200 combinations. Later, the coverage table reports 192,000 unconstrained combinations, and the reported 3.36% coverage aligns with the 192,000 denominator. That small internal inconsistency is worth noticing, but it does not change the article’s business interpretation: the denominator is large enough that a million-row database can still cover only a small fraction of the joint operational space.

The case study then applies two constraints.

First, the allowable relative-altitude envelope depends on time to loss of horizontal separation. For very small time horizons, only smaller relative altitudes are likely to represent critical encounters. For larger time horizons, the full altitude range becomes relevant. The paper models this using a logarithmic envelope:

$$ h_{max}(\tau) = \alpha \log(\tau + 1) + 300 $$

where:

$$ \alpha = \frac{1200}{\log(61)} $$

Second, the method removes vertically diverging encounters. If the ownship and intruder are moving in a way that increases vertical separation, those configurations are less critical for this coverage demonstration and are excluded from the constrained space.

Together, these constraints reduce the considered space to 78,688 combinations, a reduction of about 59.7% from the unconstrained state space reported in the paper’s discussion.

This is not cheating. Or at least, it is not cheating if the constraints are explicit, defensible, and reviewed. It is the difference between refusing to test irrelevant combinations and silently failing to test relevant ones. The former is engineering judgment. The latter is a future incident report with prettier formatting.

The low coverage numbers are the point

Here is the part that readers can easily misread.

Metric	Unconstrained	Constrained
Combinations	192,000	78,688
Covered combinations	6,455	2,062
Coverage	3.36%	2.62%

At first glance, applying constraints appears to make coverage worse: 3.36% becomes 2.62%. That is the wrong conclusion.

The existing scenarios were not designed to demonstrate ODD completeness. They came from prior research data. So the low coverage is not evidence that the workflow failed. It is evidence that ordinary scenario collections—even large ones—should not be mistaken for completeness evidence.

The constrained coverage is lower because the method has redefined the denominator to focus on physically meaningful and safety-relevant combinations. It also identifies exactly which combinations remain uncovered. That is not a regression. It is a diagnosis.

This is the paper’s most useful message for business readers: a coverage method does not primarily make your test set look better. It makes your ignorance legible.

There is an uncomfortable maturity test here. Many organizations prefer metrics that make existing assets look impressive. A serious assurance workflow may do the opposite. It may take a large archive of simulation data and say: good start, but here are the untested regions, here is why they matter, and here is the scenario-generation backlog.

That is operationally valuable. It is also politically inconvenient. Such is life outside demo day.

How to read the paper’s evidence without overreading it

The paper contains a method diagram, parameter table, constraint visualization, and coverage report. They do not all play the same evidentiary role.

Paper element	Likely purpose	What it supports	What it does not prove
Process flow diagram	Main method description	The workflow can be organized as an iterative safety-assurance loop	That the loop is sufficient for every regulated AI system
VerticalCAS state-variable table	Implementation detail for the case study	The ODD can be operationalized through concrete variables and bins	That these exact bins are optimal or transferable
Constraint-envelope heatmap	Demonstration of constraint logic	Domain constraints can reduce the relevant coverage space and reveal white-space gaps	That the chosen envelope is the only correct one
Coverage table	Main case-study evidence	Existing data covered only a small share of joint bin combinations, and constraints reduced the remaining search space	That complete coverage was achieved

There are no elaborate ablation studies here. There is no robustness sweep over bin sizes, criticality functions, alternative constraints, or dependency models. The paper itself recognizes that the VerticalCAS case uses deliberate simplifications, including uniform criticality and no complex dependency modeling.

That boundary should not be treated as a fatal flaw. It tells us what the paper is: a proof-of-process, not a finished certification recipe.

The business value is not “more testing”; it is cheaper diagnosis

For companies building safety-critical AI, the immediate temptation is to read this paper as a testing paper. That is only partly right.

The deeper business implication is that ODD coverage creates a management layer between regulatory language and engineering execution.

A compliance team can say: “We need complete ODD coverage.” An engineering team can reply: “Show us the coverage grid, the constraints, the uncovered combinations, and the scenario-generation queue.” The conversation becomes less ceremonial. Everyone suffers less. Miracles do happen.

The workflow also changes resource allocation. Instead of adding more scenarios randomly, teams can generate scenarios for missing combinations. Test effort becomes targeted. Simulation capacity, labeling capacity, hardware-in-the-loop time, and expert review time can be allocated to gaps with explicit safety relevance.

For enterprise AI governance, this suggests a useful distinction:

Governance artifact	Weak version	Stronger version enabled by this method
ODD definition	List of operating conditions	Discretized parameter space with explicit bin logic
Test evidence	Dataset size and aggregate performance	Joint coverage map over relevant combinations
Scenario generation	Random or convenience-driven sampling	Gap-driven scenario creation
Audit trail	Narrative assurance claim	Reproducible coverage report and missing-region list
Risk prioritization	Expert intuition only	Expert judgment encoded into criticality, constraints, and dependencies

This is especially relevant beyond aviation. Autonomous driving, industrial inspection, medical imaging, warehouse robotics, and AI-assisted control systems all face versions of the same problem: the operating environment is not a single distribution but a structured space of conditions. Safety evidence has to respect that structure.

Cognaptus would infer one practical design principle from the paper: treat coverage logic as part of the product architecture, not as an after-the-fact audit spreadsheet. If the ODD, binning rules, constraints, and scenario-generation pipeline are defined early, the safety case can evolve with the system. If they are assembled at the end, the team may discover that its impressive dataset is impressively irrelevant in precisely the wrong places.

Where this method can mislead if used lazily

The method’s strength is that it makes assumptions explicit. Its weakness is also that it depends on those assumptions.

The first boundary is bin design. Finer bins produce better resolution but more combinations. Coarser bins produce easier coverage but may hide safety-relevant gaps. A team can make coverage look better by choosing bins that are too broad. That is not assurance. That is spreadsheet cosmetics.

The second boundary is criticality modeling. In the VerticalCAS demonstration, the paper assumes uniform criticality for simplicity. In a real safety case, criticality may vary sharply across operational regions. For example, time-to-conflict, relative motion, sensor degradation, and advisory timing may not deserve equal treatment across their ranges. If criticality is poorly defined, the coverage structure will direct effort poorly.

The third boundary is constraint legitimacy. Removing infeasible or less critical combinations is sensible, but every exclusion should be traceable to physics, domain rules, operational assumptions, or validated expert judgment. Otherwise, constraint definition becomes a convenient machine for deleting inconvenient test obligations.

The fourth boundary is dependency modeling. The paper’s case study does not need vine copulas because the selected VerticalCAS variables are relatively low-dimensional and governed by deterministic constraints. Larger ODDs may involve stochastic dependencies among environment, sensors, human operators, traffic patterns, and system states. In those cases, dependency modeling becomes central rather than optional.

Finally, coverage is not performance. A bin combination can be covered by one data point, but that does not prove the model behaves safely throughout the bin. The paper’s coverage metric helps identify whether regions have been exercised. It does not, by itself, validate model stability, robustness, or correct behavior across every value inside a bin. Those remain separate assurance tasks.

This is not a complaint. It is simply the price of using the method correctly.

From safety theater to safety accounting

The most useful safety frameworks are often boring in exactly the right way. They turn broad promises into artifacts that can be checked. They force teams to state what counts, what is excluded, what remains missing, and what must happen next.

This paper moves AI safety assurance in that direction. It does not solve certification. It does not make high-dimensional systems easy. It does not prove that every future AI-based aviation system can be certified through this exact workflow.

What it does is more modest and more valuable: it gives engineers a way to transform an ODD from a conceptual boundary into a measurable coverage structure.

The case study’s low coverage numbers should therefore be read as a warning against lazy confidence. A large scenario archive can still leave most joint operational regions untouched. A constrained ODD can reduce the search space while making the remaining gaps sharper. And a serious safety process should welcome that discomfort because it turns uncertainty into work.

For business leaders, the lesson is plain: regulated AI will not be governed by slogans about trust. It will be governed by maps, constraints, coverage reports, and scenario backlogs.

A little less theater. A little more accounting. Annoying, yes. But aircraft, factories, hospitals, and autonomous systems tend to prefer annoying evidence over beautiful vibes.

Cognaptus: Automate the Present, Incubate the Future.

Thomas Stefani, Johann Maximilian Christensen, Elena Hoemann, Frank Köster, and Sven Hallerbach, “From High-Dimensional Spaces to Verifiable ODD Coverage for Safety-Critical AI-based Systems,” arXiv:2604.02198, 2026, https://arxiv.org/pdf/2604.02198. ↩︎

The safety problem is not a missing metric; it is unvisited territory#

The paper’s real object is a verification loop#

The coverage formula is simple because the difficulty is upstream#

The VerticalCAS case study is a process demonstration, not a victory lap#

The low coverage numbers are the point#

How to read the paper’s evidence without overreading it#

The business value is not “more testing”; it is cheaper diagnosis#

Where this method can mislead if used lazily#

From safety theater to safety accounting#