Road testing has one inconvenient flaw: reality insists on happening in real time.

That is a problem for autonomous vehicles, robots, drones, and other cyber-physical systems whose failures are rare, contextual, and often expensive to reproduce. Simulation helps because it lets engineers manufacture awkward situations on demand: the pedestrian who appears at the worst possible moment, the parked car blocking the lane, the unprotected turn that requires social judgement rather than just geometry. Lovely. Except simulation has its own embarrassing little issue: a failure in simulation may be a real system weakness, or it may be an artefact of synthetic sensor data wearing a lab coat.

The paper Querying Labeled Time Series Data with Scenario Programs tackles that gap from a refreshingly concrete angle.1 Instead of asking whether simulation is realistic in some grand philosophical sense, it asks a narrower operational question: given a formal scenario program and a labeled real-world sensor trace, can we determine whether the real trace contains that scenario?

That sounds modest. It is not. It turns scenario descriptions from simulation scripts into database-like queries over real-world logs. For safety-critical AI teams drowning in sensor data and validation obligations, that is not a minor convenience. It is a possible indexing layer between imagined failures and recorded reality.

The important correction comes early: this is not a raw-video understanding system. It does not watch camera footage like a vision-language model and declare, with theatrical confidence, that “yes, a car seems to be yielding.” The method queries label traces: time-series labels that describe objects, positions, orientations, occupied lanes, and primitive behaviours. The raw sensor data matters only after the matching labeled segment has been found.

That distinction is not a footnote. It is the whole business case.

The useful object is not a video; it is a labeled trace with semantics

The authors start from a practical sim-to-real validation problem. Simulation-based testing can identify failure-inducing scenarios, but teams then need to know whether those scenarios occur in real-world data. The old-school answer is physical reconstruction: take the failure scenario to a track, rebuild it, test it, repeat, invoice someone. That scales about as well as artisanal manufacturing.

The paper proposes a different route. If a real-world dataset has labels rich enough to describe the relevant scene and behaviour, then the team can query the dataset for traces matching a formal scenario. The output is not merely “yes” or “no”; it is a subset of labeled real-world data, and therefore a pointer back to the corresponding sensor clips. Those clips can then be used to test perception, prediction, or planning modules.

The data representation matters. A label trace contains two broad kinds of information:

Label-trace component What it represents Why it matters
Input observations Semantic state of the world, such as object positions, orientations, lanes, visibility, and object types Lets the scenario program test whether the scene conditions are satisfied
Output behaviours Primitive behaviours assigned to objects, such as following lane, lane changing, turning, braking, or being stationary Lets the scenario program check whether observed actors behave consistently with the scenario
Sets of possible behaviours Multiple plausible behaviours at a timestep, often derived from classifier confidence thresholds Lets uncertainty in behaviour classification enter the query instead of pretending labels are magically perfect

That final row is where the method becomes more serious than a brittle rule filter. If a behaviour classifier says an object might be performing several primitive behaviours above a threshold, the label trace can record a set. Matching then asks whether the program and the label trace have at least one consistent behavioural interpretation over a window.

This is also where the boundary appears. The method can be formally correct with respect to the labels and still be practically wrong if the labels are wrong. Reality, as usual, refuses to be solved by notation alone.

Scenic becomes a query language, not just a scenario generator

The paper uses Scenic as the scenario language. Scenic is a probabilistic programming language for specifying scenes and behaviours in simulation. A program can define objects, sample initial positions, specify visibility constraints, and describe behaviours that react to conditions.

A running example in the paper describes a car following a lane and changing lanes when it gets close to a stationary car in front. In Scenic, that can be represented as a probabilistic behaviour: follow the lane by default, but interrupt that behaviour and initiate a lane change when the distance condition is triggered. Because thresholds may be sampled from a distribution, the same observed situation may allow more than one feasible behaviour.

The authors’ move is to reinterpret the Scenic program as a formal query. A real label trace matches a Scenic program if two things hold:

  1. The initial observation in some window is within the support of the program’s initial scene distribution.
  2. The observed behavioural trace over that window is one the program could generate, given the observed inputs.

There is an additional wrinkle: object correspondence. The Scenic program may say ego and otherCar, while the dataset may contain car17, car29, a pedestrian, and a delivery truck minding its own business. The algorithm must find an injective mapping from program objects to observed objects. Extra objects in the dataset are allowed. Extra required objects in the program are not.

This is a good design choice. Real traffic scenes are messy. A query for “car yields to another car before turning right” should not fail because a pedestrian exists three lanes away, bravely contributing nothing to the scenario.

The automata layer is where the paper earns its keep

The accepted framing for this article is mechanism-first because the mechanism is the contribution. The headline is not “we beat GPT-4o on a small task,” tempting though that may be for the nearest slide deck. The deeper idea is that a formal scenario can be compiled into a machine that checks labeled traces with correctness guarantees.

The authors translate a supported fragment of Scenic into synchronous hierarchical finite state machines, or HFSMs. Each object’s behaviour becomes an HFSM. These HFSMs run together, synchronously, over the time steps of the label trace.

The supported Scenic fragment includes distributions such as Uniform, Range, Normal, and TruncatedNormal; behaviour statements such as do, do until, and try / interrupt; and a range of position, orientation, scalar, boolean, vector, and region operators. The restriction is not arbitrary bureaucracy. The supported fragment keeps guard evaluation dependent on the current input rather than on unbounded history. That makes the automata-based matching procedure tractable and allows the correctness argument to go through.

Guards are checked using SMT solving. If a guard contains an unobserved program variable, such as a sampled threshold from Range(1,15), the method asks whether some value in that domain makes the guard true. Suppose a car is 10 metres from another car and the interrupt condition is “distance is less than a sampled threshold between 1 and 15 metres.” There exists a threshold that makes the condition true, and there also exists a threshold that does not. The machine therefore preserves nondeterminism: more than one behaviour may remain feasible.

That nondeterminism is not a bug. It is the formal version of “the scenario program permits several possible behavioural paths here.” The query algorithm carries forward all compatible HFSM states and prunes the ones inconsistent with the observed output labels.

The matching process has three main steps:

Step Mechanism Operational consequence
Compile the program Parse the Scenic program and translate behaviours into HFSMs The scenario becomes executable as a formal recogniser
Search correspondences Use SMT constraints over object type and observed duration to map program objects to label-trace objects The system can handle scenes with extra objects and unknown object identities
Slide over time Check each length-$m$ window for initial-scene consistency and behaviour-trace consistency The query can find a scenario inside a longer trace rather than requiring it to start at frame zero

The window length $m$ is not decorative. If it is too short, the system may return clips that match for a trivial moment but do not meaningfully contain the scenario. If it is too long, valid partial occurrences may be missed. In business terms, $m$ is a policy knob: it controls how strict the retrieval system is about temporal persistence.

The authors state a correctness theorem: for a Scenic program, label trace, and integer window length, the algorithm returns true if and only if the label trace matches the program for a window of that length. That guarantee is meaningful, but it is scoped. It applies to the formal problem, the supported Scenic fragment, and the given labels. It does not say the labels were produced correctly, nor that the scenario captures all safety-relevant aspects of the world. Formal methods are precise. They are not clairvoyant.

The experiment is small, but it tests the right pressure points

The experiments are easy to misread. They do not prove that the method is ready to replace a company’s validation pipeline. They test two claims that matter for the proposed role of the method:

  1. Can it query relevant labeled data accurately enough to be useful?
  2. Does runtime scale in a way that makes the approach plausible?

The authors compare their query algorithm against GPT-4o and Claude 3.5 on four driving scenarios using a subset of nuScenes videos. The scenarios include lane changing to avoid a stationary car, yielding during an unprotected left turn, passing a pedestrian, and yielding before making a right turn. For each scenario, they prepare five videos, after manually identifying matching clips from a larger pool and adding non-matching clips.

The result table is compact but revealing:

Scenario Claude accuracy GPT-4o accuracy Query algorithm accuracy
Lane change around stationary car 0.4 0.2 1.0
Unprotected left turn after yielding 0.2 0.6 0.6
Passing pedestrian in lane 0.6 0.8 1.0
Yield before right turn 0.6 0.8 0.6
Average 0.45 0.60 0.80

The runtime comparison is more dramatic: the paper reports average runtime of roughly 6.33 seconds for Claude, 41.19 seconds for GPT-4o, and 0.06 seconds for the algorithm. That is the kind of number that makes procurement departments briefly sit up, before legal asks seventeen follow-up questions.

But the accuracy result deserves careful interpretation. The algorithm does not win because it “understands video” better. It wins because it is not trying to understand raw video. It uses structured labels already available in the dataset, while the VLMs must inspect RGB video and answer a complex natural-language query. That is not an unfair comparison; it is a comparison between two possible retrieval interfaces. But the interface assumptions are different.

The authors are admirably clear about the failure cases. The algorithm performs perfectly on scenarios 1 and 3, but drops to 0.6 on scenarios 2 and 4. The issue is not the formal matcher hallucinating. It is label quality. In one case, a vehicle executing an unprotected left turn has primitive behaviours in the label trace that unrealistically switch from left turn to right turn. The matcher can only work with what the labels say. Garbage in, formally verified garbage out. Very elegant garbage, but still.

The appendix is reproducibility plumbing, not a second thesis

The appendix material matters because it clarifies what is evidence, what is implementation detail, and what is merely there to prevent future arguments at reviewer meeting hour.

Paper element Likely purpose What it supports What it does not prove
Accuracy experiment against GPT-4o and Claude Main evidence and comparison with a flexible natural-language baseline Structured scenario queries can be faster and more accurate on the tested labeled-video retrieval task General superiority over VLMs for all video understanding tasks
Four Scenic programs in the appendix Implementation detail and reproducibility support The scenarios were manually encoded as formal programs, not inferred from vague prose That all safety scenarios are easy to encode
VLM prompt and scenario descriptions Implementation detail for comparison The VLMs were given explicit instructions, including object correspondence requirements That prompt engineering has been exhausted
Scalability with trace duration Sensitivity test for temporal length Runtime scales approximately linearly with duration when object correspondence is small Scalability to crowded scenes
Scalability with number of objects Boundary test Correspondence search becomes the bottleneck; eight objects time out at 10 seconds That the method is unsuitable for all multi-object settings

This table is more than neat article furniture. It prevents the common interpretive mistake: treating every figure as equally central. The VLM comparison is evidence for a practical retrieval advantage. The scalability plots identify where the method behaves well and where it starts coughing. The appendix programs and prompts explain how the experiment was conducted; they are not independent validation miracles.

The scalability story is half comforting, half rude

The second experiment studies scalability. For duration, the authors generate synthetic label traces from 20 to 100 timesteps using Scenic and CARLA, then query them. With only two agents in the program, runtime scales approximately linearly with trace length. That is the comforting half. If the query has a fixed number of objects, searching longer traces mostly means sliding across more windows and evaluating more time steps.

The rude half is object count. The authors increase the number of objects from 2 to 8 by replicating objects and behaviours in the Scenic program. Runtime grows exponentially and the eight-object query times out at 10 seconds. The reason is correspondence search. If the program has eight objects and the trace has eight plausible objects of the same relevant type, the worst-case number of object correspondences is $8! = 40{,}320$.

That does not make the work impractical. It makes it honest. Many valuable safety queries involve a small number of central actors: ego vehicle, one car, one pedestrian, maybe a second interacting vehicle. For those, the method could be useful. Dense urban scenes with many plausible object mappings will need smarter pruning, domain constraints, tracking continuity, role inference, or learned candidate ranking. The authors explicitly identify improving correspondence search and supporting a larger Scenic fragment as future work.

This is the part where a bad product pitch would say “the system scales linearly.” It does not. It scales linearly with duration under small correspondence sets, and exponentially with object count under the tested correspondence process. That distinction is the difference between an engineering roadmap and a brochure.

The business value is evidence indexing, not certification theatre

For companies building autonomous systems, robotics platforms, or safety-critical AI stacks, the practical pathway is clear:

  1. Use simulation and scenario modeling to identify candidate failure scenarios.
  2. Encode those scenarios as Scenic programs.
  3. Query labeled real-world datasets for matching traces.
  4. Replay or evaluate system components on the retrieved sensor segments.
  5. Use missing matches as signals about dataset coverage.

That last point is underrated. If no matching real-world traces are returned, the conclusion is not “the risk does not exist.” It is “this scenario is not represented in the indexed labeled dataset under this query definition.” For dataset governance, that is valuable. It tells teams where their evidence base is thin.

The paper’s contribution can be translated into business terms this way:

Technical contribution Business interpretation Boundary
Formal definition of matching between Scenic programs and label traces Turns simulation scenarios into auditable retrieval criteria Depends on labels that capture the relevant semantics
HFSM compilation with SMT guard checking Makes scenario matching executable and formally scoped Limited to the supported Scenic fragment
Sliding-window query over traces Finds scenarios embedded inside longer logs Requires careful choice of match duration
VLM comparison on nuScenes subsets Shows structured labels can beat raw-video natural-language querying on the tested task Small experiment; not a universal claim about VLMs
Scalability analysis Identifies duration as manageable and object correspondence as the bottleneck Crowded scenes remain hard

The likely buyer inside an autonomy organisation is not the person looking for a shiny demo. It is the validation lead asking a dull but expensive question: “Have we seen this kind of failure in our real data?” This method makes that question less mystical.

It also changes how simulation findings might be managed. A failure discovered in simulation can become a reusable query object. That query can be run against new data releases, new regions, new weather subsets, or new label versions. The organisation gains a regression-testing pattern: not just “run the model on benchmarks,” but “continually ask whether the evidence base contains the behavioural situations our simulator says matter.”

The ROI, if it exists, is not that the algorithm removes track testing. It does not. The ROI is cheaper triage: fewer blind searches through logs, faster construction of targeted replay sets, better visibility into dataset gaps, and a more disciplined bridge between formal scenario design and empirical validation.

The labels are the contract, and the contract can be wrong

The most dangerous misconception is that this method validates real-world safety directly. It does not. It validates whether a labeled trace matches a formal scenario. That is a narrower and more useful claim.

Three boundaries matter.

First, label fidelity is decisive. The experiment’s own failures in scenarios 2 and 4 come from primitive-behaviour label errors. If behaviour classifiers misclassify turns, braking, lane following, or yielding, the query result inherits that error. Better label pipelines improve the method; weak label pipelines turn it into a fast way to retrieve nonsense.

Second, scenario expressiveness is scoped by the supported Scenic fragment. The method supports meaningful sequential and interrupt-driven behaviours, but it does not cover arbitrary Scenic programs or arbitrary histories. That is not a defect; it is the price of a tractable formal algorithm. Still, product teams must not quietly expand the marketing language from “supported fragment” to “all scenarios.” That way lies demo-driven archaeology.

Third, object correspondence is the computational choke point. The approach is promising for small-agent scenarios and less ready for scenes where many objects could plausibly fill the same role. In practice, teams would likely combine this method with object-type filters, map context, spatial constraints, tracker identities, or learned prefilters to reduce the correspondence search space before the formal matcher runs.

These boundaries do not weaken the paper. They make it usable. A tool with a clear contract is easier to integrate than a magical video oracle that performs beautifully until somebody asks it to explain itself.

The real contribution is a safer interface between simulation and logs

The paper’s best idea is not that automata are faster than VLMs. Of course they are, when the world has already been turned into symbolic labels. The best idea is that simulation scenarios can become queries over real-world evidence.

That reframes sim-to-real validation. Instead of treating simulation and real data as separate kingdoms—one expressive but synthetic, the other authentic but unstructured—the paper proposes a formal interface between them. Scenic describes the scenario. Labels describe the observed world. HFSMs and SMT checks decide whether the two are consistent over time.

For business readers, the implication is blunt: the next layer of autonomy validation may look less like watching endless video and more like maintaining a library of formal behavioural queries. That library will not certify safety by itself. It will not replace track testing. It will not rescue poor labels from themselves. But it can make the evidence workflow less theatrical and more repeatable.

And in safety-critical AI, repeatability is not glamour. It is infrastructure. Which is usually where the money eventually goes, once the demo budget has finished making everyone feel futuristic.

Cognaptus: Automate the Present, Incubate the Future.


  1. Edward Kim, Devan Shanker, Varun Bharadwaj, Hongbeen Park, Jinkyu Kim, Hazem Torfah, Daniel J. Fremont, and Sanjit A. Seshia, “Querying Labeled Time Series Data with Scenario Programs,” arXiv:2511.10627, 2025, https://arxiv.org/abs/2511.10627↩︎