When AI Drives, Who’s in Control? — Reclaiming Determinism in Agentic Systems

A car does not care whether an AI answer is impressive. It cares whether the answer arrives before the intersection.

That small timing problem is where a large part of today’s agentic AI discussion becomes unserious. We keep asking whether models are smart enough to act. In cyber-physical systems, the more painful question is whether the system around the model can make action repeatable, bounded, and recoverable when the model is late, vague, or simply wrong.

The paper Agentic Driving Coach: Robustness and Determinism of Agentic AI-Powered Human-in-the-Loop Cyber-Physical Systems studies exactly that problem through a simulated driving-coach system built with Lingua Franca and the reactor model of computation.¹ The use case is concrete: an AI coach observes a driver, a car, and the surrounding environment, then issues driving instructions or emergency interventions. But the more important contribution is not “LLM drives car.” Thankfully, no.

The contribution is architectural: the authors show how an agentic AI component can be placed inside a deterministic runtime envelope, so that for the same initial state and the same sequence of human behavior, environmental inputs, LLM outputs, and LLM timing, the system produces the same behavior again.

That distinction matters. Determinism here does not mean the human becomes predictable. It does not mean the model becomes magically reliable. It means their variability is treated as input to a system whose coordination logic remains deterministic. In other words: do not remove the chaos; make it report to a scheduler.

The real problem is not model randomness; it is unbounded execution

Most business conversations about agentic AI still orbit around model quality. Is the model accurate enough? Does it hallucinate? Can we fine-tune it? Can we prompt it harder, perhaps with more capital letters and moral urgency?

Those are not trivial questions. But in a human-in-the-loop cyber-physical system, they are incomplete. A correct answer delivered too late can be operationally identical to a wrong answer. A well-written instruction can still be dangerous if it arrives after the vehicle has passed the relevant decision point. A low-latency model can still be unusable if it gives poor instructions. The paper opens with this exact tension: a driving coach may fail because the advice is delayed, or because the advice is timely but incorrect.

That gives us three separate sources of nondeterminism:

Source of variability	What varies	Why ordinary AI evaluation misses it
Human driver	Attention, reaction, braking, steering, head position	User behavior is usually treated as external noise, not as part of the system contract.
Physical environment and vehicle	Velocity, displacement, state changes over time	The world keeps moving while the model thinks. Annoying, but traditional benchmarks rarely care.
LLM-based coach	Instruction quality, control signal, response latency	Accuracy benchmarks measure output quality, not whether the output arrived in time to matter.

The authors’ framing is useful because it avoids a common false target. The goal is not to eliminate nondeterminism from the human or the model. That would be a charming grant proposal and a terrible engineering plan. The goal is to model the system so that each source of variability is explicitly represented as an input stream.

They formalize system behavior as:

$$ y(t) = F(x_i, i_h(t), i_c(t), i_a(t)) $$

where $x_i$ is the initial system state, $i_h(t)$ is driver behavior, $i_c(t)$ is car and environmental input, and $i_a(t)$ is the agentic coach input. The key is not the formula itself. The key is what the formula refuses to hide: the human and the AI are not assumed to be stable internal machinery. They are treated as inputs that enter a deterministic coordination system.

For businesses building AI copilots, robotics assistants, operations agents, or workflow automation around human action, this is the paper’s first useful lesson. The model is not the system. The system is the model plus the timing, interfaces, fallback rules, state transitions, and recovery behavior around it.

Reactor architecture turns the AI into one component, not the monarch

The implementation uses Lingua Franca, an open-source framework based on the reactor model of computation. A reactor model organizes concurrent components through ports, deterministic scheduling, hierarchical composition, and explicit treatment of timing. This is not the usual “agent calls tools until vibes improve” architecture. It is closer to a control architecture where each component has defined inputs, outputs, and reaction rules.

In the driving-coach case study, the authors model four major reactors:

Reactor	Operational role	Why it matters for determinism
Driver	Models human behavior, including accelerator, brake, steering, and head position	Human actions are represented as observable inputs rather than informal background context.
Car	Updates vehicle states such as velocity, steering, and head position signals	Physical consequences are computed through explicit state transitions.
Environment	Tracks displacement and external scenario context	The world state is fed back into the coach as a structured signal.
Coach	Contains LLM inference and planner logic	The AI is split into generation and supervised planning, rather than being allowed to act directly.

That last row is the most important. The Coach is itself a composite reactor with two subcomponents: LLMInference and Planner.

LLMInference calls the language model and generates two outputs: a control signal and a driving instruction. Planner then interprets those outputs through a modal control structure. The LLM does not simply whisper a poetic suggestion into the steering wheel. It must produce something that the planner can classify and process.

The paper uses three control signals:

Control signal	Meaning	Planner consequence
`NONE`	Driver remains within acceptable bounds	Continue monitoring.
`WARNING`	Driver behavior deviates but remains recoverable	Issue corrective instruction, throttled to avoid overwhelming the driver.
`ACTUATE`	Safety limits are violated	Trigger actuation, such as emergency braking.

This is the mechanism-first heart of the paper. The LLM is not trusted because it is intelligent. It is constrained because it is useful but unreliable. The system extracts a bounded signal from the model, passes that signal through a modal planner, and gives the planner hard-coded authority to fall back when timing or safety conditions fail.

That architectural move is easy to understate. It is also the part many agentic AI demos skip. They show the model choosing actions. They do not show the machinery that decides whether the model is allowed to choose actions.

The prompt is not the safety layer; it is only the first filter

The paper uses a structured prompt to reduce ambiguity. For the stop-sign scenario, the model is instructed to output exactly one line in the format:

TOKEN|Message

The token must be one of NONE, WARNING, or ACTUATE, and the message must be a single sentence. The prompt includes rule-like conditions, such as triggering ACTUATE when the car is too close to the stop sign and moving too fast. The authors also set temperature to zero and cap generation at 30 tokens.

This is sensible. It is not sufficient.

The interesting design point is that the prompt is only one layer in the envelope. The system still includes deadline handling, planner modes, logical delays, and fallback behavior. That is the correct hierarchy. Prompting narrows the output surface. It does not make the system safe by itself.

A business analogy is useful here. A structured prompt is like a form field. It can reduce messy input. But if the form feeds directly into payroll, factory control, trading execution, or fleet management without validation and escalation logic, the organization has not built governance. It has built a nicely formatted accident.

The paper’s architecture is more disciplined:

Layer	Mechanism	What it controls	What it cannot solve alone
Structured prompt	Fixed output format and scenario rules	Verbosity, ambiguity, malformed instructions	Model correctness under all contexts
Planner modes	Monitoring, warning, actuation	Whether output becomes advice or intervention	Bad upstream perception or scenario coverage gaps
Deadline handler	Detects late inference and triggers fallback	Timing failure	Poor fallback design
Logical delays	Models human and system response delays	Timing realism	Real-world variability beyond the modeled assumptions
Hard-coded fallback	Emergency braking / pull-over logic	Last-resort safety behavior	Whether the fallback is optimal in every physical condition

This is a better mental model for enterprise agent design than the usual “prompt plus tool call” diagram. The LLM is useful because it can interpret context and generate adaptive guidance. But the surrounding system must decide how much of that guidance is allowed to affect the world.

Time is modeled as part of the system, not as an implementation detail

The authors explicitly model delays that many AI prototypes quietly ignore. The driver perceives the environment through a timer that triggers every 100 milliseconds. Human response delay is modeled with 500-millisecond logical delays between Driver and Car. Coach-to-actuation delay is modeled as 200 milliseconds.

These numbers are not decorative. They change what the system can safely do.

A driving coach that assumes instant human compliance is not a driving coach. It is a fantasy with tires. In the same way, an AI operations agent that assumes instant approval, instant execution, instant data refresh, and instant rollback is not an operations system. It is a slide deck that has not yet met production.

The LLM deadlines are also empirically chosen. The authors run inference 300 times for each locally deployed 4-bit quantized Llama 3 model and use the worst-case measured latency as the deadline:

Model	Worst-case measured latency used as deadline	Interpretation
Llama 3 1B	186 ms	Fast enough, but later shown to be instruction-poor and unsafe in the scenarios.
Llama 3 8B	250 ms	Slower, but more capable than 1B.
Llama 3 70B	613 ms	Much slower, but produces safer behavior with fewer instructions and interventions in the evaluated scenarios.

This table contains the paper’s most practical trade-off. Smaller models can be faster and still dangerous. Larger models can be safer and still create timing risk. The architecture exists because neither model quality nor model speed wins alone.

The authors still observe deadline misses even after using worst-case measured latency. They attribute this to LF execution overhead and interference, frequent inference triggering every 100 milliseconds, and prolonged inference when the driver behavior and vehicle dynamics change. That observation is not a failure of the paper. It is one of the most useful parts.

Benchmark latency is not deployment latency. The surrounding runtime matters. The input stream matters. The frequency of calls matters. A model that looks acceptable in isolation may miss deadlines when placed inside a live coordination loop. Anyone deploying agentic systems into operations should frame that sentence and place it somewhere near the Kubernetes bill.

The experiments are evidence of deterministic orchestration, not proof of autonomous driving safety

The evaluation uses three simulated driving scenarios: stop sign, speed change, and lane change. The driver behavior is fixed and manually specified for each scenario. The system is tested with local Llama 3 models of different sizes.

The likely purpose of the evaluation is best separated into three parts:

Test element	Likely purpose	What it supports	What it does not prove
Repeated reactor execution under same inputs	Main evidence for deterministic orchestration	Same initial state and same input sequence produce the same system behavior, including deadline handling.	That the human or LLM itself is intrinsically deterministic in the real world.
Llama 3 1B / 8B / 70B comparison	Model-capability and latency trade-off exploration	Faster models may be unsafe if instruction quality is poor; larger models may reduce interventions but increase deadline pressure.	A general ranking of all model families or deployment settings.
Stop sign, speed change, lane change scenarios	Scenario coverage for the driving-coach implementation	The architecture can support multiple types of driver guidance and intervention.	Robustness across real roads, sensors, weather, driver populations, or unusual edge cases.
Deadline misses and fallback behavior	Implementation stress point	The architecture can detect late model responses and engage fallback mechanisms.	That the chosen fallback policy is always the safest real-world action.

This distinction prevents overreading the paper. The paper does not show that an LLM should be allowed to drive. It shows that if an LLM-like agent participates in a cyber-physical loop, deterministic orchestration can make its behavior repeatable under fixed inputs and can force late responses into analyzable failure paths.

That is a smaller claim than “AI driving coach solved.” It is also a more useful one.

The stop-sign result shows why speed alone is a weak proxy for safety

In the stop-sign scenario, the car begins 100 meters before the stop sign at 10 m/s and must reach 0 m/s at the stop sign. The authors define a desired velocity curve and safe velocity bounds around it.

The 1B model fails badly. It does not produce correct driving instructions or actuation commands, so the car fails to slow down and stop at the stop sign. The paper calls its inference accuracy dangerously low and unsuitable for this agentic driving-coach role, even though its latency is lower than that of larger models.

That finding is not surprising, but it is worth emphasizing because many businesses still treat smaller models as the obvious choice for edge deployment. The logic sounds reasonable: smaller model, lower latency, lower cost, easier deployment. But in systems where the model must classify state and trigger intervention, a fast bad answer is not efficiency. It is accelerated liability.

The 8B and 70B models perform better. The larger 70B model produces fewer driving instructions and actuation commands while stopping nearer to the stop sign than the 8B model. In plain terms, the better model appears less noisy and more effective: fewer interventions, safer behavior.

But the result does not remove the latency problem. It sharpens it. The 70B model has the highest measured worst-case latency, 613 milliseconds. Its better judgment is operationally valuable only if the system can manage the extra time cost. Hence the paper’s real answer: do not choose between intelligence and timing as if they are spreadsheet columns. Build an architecture where timing failure is detected and handled.

The speed-change and lane-change scenarios test whether the envelope generalizes

The speed-change scenario asks the car to decelerate from 18 m/s to 11 m/s as it approaches a speed-limit sign 100 meters away. The lane-change scenario asks the car to move right within 100 meters while maintaining 18 m/s and ensuring the driver checks the right lane before steering.

The paper excludes the 1B model from the speed-change and lane-change plots because it also leads to unsafe behavior in those scenarios. For 8B and 70B, the larger model again guides the driver with fewer instructions and actuation commands. In the lane-change scenario, the authors observe more deadline misses than in the other scenarios, partly because the coach must monitor head position, velocity, and acceleration.

This is an implementation detail with broader meaning. More context is not free. Every additional signal the agent must monitor increases system burden. In enterprise terms, the agent that watches ten systems, three approval chains, six risk indicators, and a Slack thread is not merely “more context-aware.” It is also more exposed to timing, coordination, and failure-mode complexity.

The lane-change result is therefore useful not because lane changing is the future of office automation, although some meetings do make one wonder. It is useful because it shows that richer situational awareness can increase runtime pressure. The architecture must handle that pressure explicitly.

What businesses should actually take from this paper

The most transferable lesson is not about cars. It is about how to deploy agentic AI when wrong timing has consequences.

Many business processes are not cyber-physical in the automotive sense, but they still have time-sensitive action loops: fraud review, treasury operations, incident response, warehouse routing, customer escalation, procurement approval, compliance monitoring, medical operations, and trading workflows. In these settings, AI output is not merely content. It changes what a person or system does next.

That means agentic systems need runtime contracts, not just better prompts.

Technical contribution in the paper	Business translation	ROI relevance	Boundary
Treat human and AI variability as input streams	Log and structure user actions, model outputs, tool results, and timing as first-class events	Better auditability and reproducible failure analysis	Requires disciplined instrumentation, not just an API wrapper
Reactor-style coordination	Use deterministic orchestration for concurrent agents, tools, and human approvals	Fewer race conditions and clearer ownership of decisions	May add engineering complexity for low-risk workflows
Deadline handling	Define when an AI answer is too late to be used	Reduces operational risk in time-sensitive processes	Deadline thresholds must be empirically tested in the actual workflow
Modal planner	Separate monitoring, warning, and intervention states	Prevents every model output from becoming an action	Requires explicit state design and escalation policy
Hard-coded fallback	Keep a safe default behavior outside the model	Provides a last-resort control path when the model fails	Fallbacks can be crude and must be domain-reviewed

The practical inference is clear: companies should stop asking whether an AI agent is “autonomous” as if autonomy were a badge of progress. The better question is: under what state, deadline, confidence, and safety conditions is the agent allowed to influence action?

That question is less glamorous than “fully autonomous enterprise AI.” It also has the advantage of sounding like something a responsible adult might approve.

Determinism means repeatable failure, not guaranteed success

A subtle but important point: deterministic orchestration does not guarantee good outcomes. It guarantees that under the same inputs, the system behaves the same way. That includes failures.

This is not a weakness. In safety engineering, repeatable failure is far more useful than mysterious failure. If a system misses a deadline and triggers fallback in a repeatable way, engineers can analyze it, tune it, test it, and decide whether the fallback is acceptable. If the same situation produces different timing races, inconsistent tool calls, or unlogged decisions, the organization is reduced to folklore.

The paper’s deterministic claim should therefore be read precisely:

Reader belief	Correction	Why it matters
“Determinism means the LLM will always give the same correct answer.”	No. The LLM’s output and timing are treated as inputs to the deterministic system.	The architecture contains nondeterminism; it does not abolish it.
“A safer model is always a faster model.”	No. The 70B model is slower but produces safer behavior in the evaluated scenarios.	Deployment needs latency-aware control, not simple model-size slogans.
“Prompt constraints are enough.”	No. The prompt is paired with planner modes, deadlines, delays, and fallback rules.	Prompt engineering is only one guardrail.
“This proves LLMs can control vehicles.”	No. The study is a simulated case with fixed driver behavior and limited scenarios.	The result is architectural, not a product certification.

This is also why the mechanism-first reading is more useful than a case-study summary. If we focus only on the driving coach, the paper becomes a neat demo. If we focus on the runtime envelope, it becomes a template for thinking about agentic AI in operational systems.

Boundaries: what the paper does not settle

The paper is careful enough that its limits are visible.

First, the evaluation is simulation-based. The driver behavior is fixed and manually specified. That makes the deterministic claim easier to test, but it does not represent the messy range of real human driving behavior.

Second, the system does not yet integrate real-time sensor inputs from real human drivers. The authors explicitly list that as future work. In real deployment, perception errors, sensor delays, network issues, and ambiguous environmental conditions would all complicate the input stream.

Third, the scenarios are limited to stop sign, speed change, and lane change. These are useful test cases, but not a comprehensive driving environment. The architecture may generalize better than the specific rules, but that remains to be demonstrated.

Fourth, the fallback logic is hard-coded. That is exactly what makes it dependable as a last resort, but it also means fallback quality depends on domain-specific engineering. “Emergency brake” is not a universal solution. In business workflows, the equivalent fallback might be “pause execution and escalate,” “reject transaction,” “switch to manual review,” or “continue previous stable state.” Choosing the wrong fallback can be safer than model improvisation and still operationally expensive.

Finally, the model comparison is not a universal benchmark of Llama 3 sizes. It is an evaluation inside this architecture, with local 4-bit quantized models, an Ollama runtime, specific hardware, structured prompts, and scenario-specific rules. The business lesson is the trade-off pattern, not the exact ranking table.

The design principle: make the agent legible to the system

The paper’s quiet strength is that it moves agentic AI away from personality and toward legibility.

A useful agent in a safety-sensitive loop must be legible to the surrounding system. Its output must be parseable. Its timing must be measured. Its state transitions must be bounded. Its failures must trigger known paths. Its authority must be conditional.

This is less romantic than the image of an AI agent “reasoning” through the world. But it is how systems become deployable. Intelligence without coordination is just another source of nondeterminism. Sometimes a very expensive one.

The driving-coach demo is therefore a useful warning for enterprise AI. The next phase of agentic systems will not be won only by teams with the cleverest prompts or the largest context windows. It will be won by teams that understand that agents need operating envelopes: typed inputs, strict output schemas, deadline policies, state machines, logging, fallback behavior, and repeatable testing.

The slogan version is simple: do not let the model drive the system. Let the system decide when the model is allowed to help.

Control is not the enemy of intelligence. In agentic AI, control is what makes intelligence usable.

Cognaptus: Automate the Present, Incubate the Future.

Deeksha Prahlad, Daniel Fan, and Hokeun Kim, “Agentic Driving Coach: Robustness and Determinism of Agentic AI-Powered Human-in-the-Loop Cyber-Physical Systems,” arXiv:2604.11705, 2026. https://arxiv.org/abs/2604.11705 ↩︎

The real problem is not model randomness; it is unbounded execution#

Reactor architecture turns the AI into one component, not the monarch#

The prompt is not the safety layer; it is only the first filter#

Time is modeled as part of the system, not as an implementation detail#

The experiments are evidence of deterministic orchestration, not proof of autonomous driving safety#

The stop-sign result shows why speed alone is a weak proxy for safety#

The speed-change and lane-change scenarios test whether the envelope generalizes#

What businesses should actually take from this paper#

Determinism means repeatable failure, not guaranteed success#

Boundaries: what the paper does not settle#

The design principle: make the agent legible to the system#