Do They Mean It? Testing Whether AI Actually ‘Reasons’ Behind the Wheel

A car follows a cyclist on a narrow road. The double solid yellow line says: do not cross. The empty oncoming lane says: perhaps you can. The cyclist may feel uncomfortable being followed. The passenger may be late. The vehicle behind may be getting impatient. The automated vehicle must choose.

A normal benchmark would ask whether the final maneuver is safe, legal, smooth, or close to a human reference trajectory. Useful, yes. Complete, no.

The more uncomfortable question is whether the AI system responded to the right reasons. Did it stay behind because it understood legal restraint? Did it overtake because the safety margin, cyclist comfort, and traffic flow jointly justified an exception? Or did it simply produce a plausible decision and then decorate it with a tidy explanation after the fact, as language models so often do when asked to sound thoughtful?

That is the problem behind CARE-Drive, a framework proposed by Lucas Elbert Suryana and colleagues for evaluating reason-responsiveness in vision-language models used for automated driving.¹ The paper is not just another “LLMs can drive now” exercise, thankfully. It is more useful than that. It asks how we can test whether a model’s driving decision changes when human-relevant reasons and contextual variables change.

That distinction matters. In safety-critical AI, a good explanation is cheap. A system can say “safety first” with the solemn confidence of a compliance slide. The harder test is whether its decisions actually move when the safety margin, social pressure, or efficiency trade-off changes.

CARE-Drive is valuable because it turns that philosophical concern into an operational audit.

The core test: do decisions move when reasons move?

The paper starts from a familiar weakness in AI evaluation. Many driving benchmarks focus on outcome quality: collision rate, trajectory accuracy, rule compliance, or maneuver smoothness. These are necessary. No one wants a beautifully reasoned crash.

But outcome metrics do not answer a different question: whether the decision was made for human-relevant reasons. In an ambiguous driving scene, two actions may both be physically safe, but one may better reflect human judgment because it balances legality, cyclist comfort, traffic pressure, and efficiency.

CARE-Drive uses the concept of Meaningful Human Control, especially the “tracking” condition: automated systems should track the human-relevant reasons that justify decisions. The framework does not require opening the model’s internal representations. Instead, it evaluates observable behavior.

The mechanism is simple enough to state, but demanding enough to be useful:

Hold the driving task and scene representation stable.
Compare a baseline model decision with a reason-augmented decision.
Calibrate prompts so results are not just prompt noise.
Systematically vary context variables tied to human-relevant reasons.
Measure whether the model’s decision probability changes in interpretable ways.

In other words, do not ask the model, “Please explain your driving decision.” Ask instead: “When the reason changes, does your decision change too?”

That is a much better audit question.

CARE-Drive separates prompt instability from reason-responsiveness

The most important design choice in the paper is the two-stage structure. Without it, the experiment would be easy to misread.

Stage 1 is prompt calibration. The authors vary model choice and reasoning style to identify a configuration that gives stable, expert-aligned decisions under a fixed overtaking scenario. This step is not the main evidence about contextual sensitivity. It is a filter. The authors are trying to avoid a common trap in LLM evaluation: mistaking random prompt behavior for meaningful reasoning.

Stage 2 is contextual evaluation. Once the prompt configuration is fixed, the authors vary observable context variables such as time-to-collision with an oncoming vehicle, whether another vehicle is behind the AV, passenger urgency, following time behind the cyclist, and explanation-length constraints. They then estimate how these variables change the probability that the model chooses to overtake.

This is the right order. First remove obvious instability. Then test sensitivity.

CARE-Drive component	What it does	Why it matters for business evaluation
Baseline prompt	Asks the VLM to decide without explicit human reasons	Reveals the model’s default decision tendency
Reason-augmented prompt	Adds structured normative reasons such as safety, legality, efficiency, comfort, fairness, and social appropriateness	Tests whether explicit human-centered guidance changes behavior
Prompt calibration	Screens model and reasoning-strategy combinations across repeated runs	Reduces the risk of treating stochastic prompt variation as “reasoning”
Context perturbation	Varies safety margin, rear-vehicle pressure, urgency, following time, and explanation length	Measures whether decisions track normatively meaningful context
Logistic analysis	Estimates effect direction and magnitude	Converts “seems responsive” into interpretable evidence

The mechanism-first view matters because the cyclist case is only the demonstration. The reusable idea is the audit pattern.

A bank could test whether a credit model responds to income stability and debt burden rather than protected-class proxies. A hospital could test whether a triage assistant responds to clinical urgency rather than documentation style. A legal AI tool could test whether recommendations move when precedent strength changes rather than when irrelevant rhetorical framing changes.

The domain changes. The audit logic survives.

The overtaking scenario is not a toy, even if it looks simple

The paper’s case study uses an automated vehicle deciding whether to overtake a cyclist. The scene is deliberately ambiguous. The road has double solid yellow lines, so crossing into the opposite lane is legally prohibited. At the same time, human drivers may sometimes overtake when the opposite lane is clear, following time is prolonged, and remaining behind creates discomfort or inefficient traffic flow.

This is not a “spot the stop sign” problem. It is a trade-off problem.

The model must choose between:

staying behind the cyclist, preserving strict rule compliance and avoiding risk;
overtaking, potentially improving efficiency and reducing prolonged following discomfort, but violating the road marking.

The authors use human-relevant reasons derived from prior expert work: safety, rule compliance, efficiency, comfort, environmental impact, social appropriateness, fairness, cultural adaptation, acceptance, interaction, vigilance and readiness, continuous control, and control transition. These reasons are included as a structured normative policy. Importantly, they do not directly tell the model which action experts prefer. They specify considerations, not the answer.

That detail matters. If the prompt simply said “experts recommend overtaking,” the experiment would be less interesting. It would test obedience, not reason-responsiveness.

Baseline models chose rule compliance every time

The first major finding is blunt.

Across the tested baseline conditions, without explicit human reasons, the VLM always chose to stay behind the cyclist. This held across the tested GPT-based VLMs and prompt structures. The model defaulted to strict legal compliance.

That is not irrational. A system staying behind a cyclist on a double-solid-line road is not obviously “wrong.” But the expert reference decision in the paper’s calibrated scenario favored overtaking under certain conditions, because the scene involved a trade-off among safety, comfort, legality, and efficiency.

Once structured human reasons were added, overtaking behavior appeared. The results varied by model and reasoning strategy:

Prompt configuration	Model	Overtake count out of 30
Role only	GPT-4.1	0
Role only	GPT-4.1-mini	0
Role only	GPT-4.1-nano	0
Role + human reasons	GPT-4.1	20
Role + human reasons	GPT-4.1-mini	15
Role + human reasons	GPT-4.1-nano	0
Role + human reasons + CoT	GPT-4.1	30
Role + human reasons + CoT	GPT-4.1-mini	28
Role + human reasons + CoT	GPT-4.1-nano	8
Role + human reasons + ToT	GPT-4.1	30
Role + human reasons + ToT	GPT-4.1-mini	25
Role + human reasons + ToT	GPT-4.1-nano	11

This is the paper’s first useful signal: explicit normative guidance changed the model’s decision behavior. The model did not merely produce a longer explanation while keeping the same action. In the calibrated setting, reasons moved decisions.

But there is a catch, as there usually is when AI appears to have learned wisdom overnight. The effect depended strongly on model capacity and reasoning structure. GPT-4.1 with Chain-of-Thought or Tree-of-Thought reached perfect overtaking alignment in the high-tension calibration scenario. Smaller models were less reliable.

For business users, this is already a useful warning. “We added reasoning instructions” is not an implementation guarantee. The same prompt pattern can behave very differently across model variants. The invoice may be smaller with a cheaper model, but so may be the part where the model notices the point.

Tree-of-Thought won calibration, but calibration is not the final claim

After the first screening, the authors compared GPT-4.1 with Chain-of-Thought and Tree-of-Thought across three scenario types and two explanation-length regimes. The purpose of this test is best understood as robustness calibration, not as a separate thesis about Tree-of-Thought being universally superior.

The three scenarios were: no oncoming vehicle, oncoming vehicle, and a vehicle behind the AV. Scenario 1 and Scenario 3 shared the same dashboard image because the rear vehicle was not visible; that difference was encoded only in observable context. Scenario 2 included oncoming traffic and therefore made the overtaking decision more safety-critical.

The calibration results showed that Tree-of-Thought was generally more robust than Chain-of-Thought, especially under the unconstrained explanation setting. In the safety-critical oncoming scenario, Tree-of-Thought produced a 93.33% overtaking rate under no-limit explanation, while Chain-of-Thought produced 30%. Under few-sentence explanations, both dropped to 0% in the oncoming scenario.

That last point is not a footnote. It is a warning label.

Explanation length, or more precisely reasoning bandwidth, affected decisions sharply. When the model had less room to reason, it became more conservative. This does not mean “long explanations are always better.” It means that output constraints can alter behavior, not merely presentation.

For real-time AI systems, this is deeply inconvenient. Latency pressure often pushes teams toward shorter outputs and compact reasoning. CARE-Drive shows that this compression can change the decision distribution itself. The model is not a spreadsheet where shorter formatting leaves the calculation unchanged. Apparently, the wrapping can tug at the result. How charming.

Stage 2 asks which reasons the model actually tracks

The main evidence comes after calibration. The authors fix the model and prompt configuration, then run a full-factorial contextual sensitivity analysis using the oncoming-vehicle scenario as the visual baseline.

They vary five factors:

Variable	Normative role in the test	Expected intuition
Time-to-collision with oncoming vehicle	Safety margin	Larger TTC should make overtaking more acceptable
Vehicle behind AV	Social pressure / traffic flow	Rear pressure may increase overtaking tendency
Passenger urgency	Efficiency pressure	Urgency might increase overtaking tendency
Following time behind cyclist	Cyclist comfort / delay	Longer following may increase overtaking tendency
Explanation-length regime	Reasoning bandwidth and latency	Shorter explanations may reduce decision flexibility

Each condition is run 30 times. The outcome is binary: overtake or stay behind. The authors then estimate a logistic regression to quantify how each factor affects overtaking probability.

The resulting pattern is the real contribution of the paper: the model is reason-responsive, but unevenly so.

Time-to-collision had the strongest positive effect. Moving from the minimum to maximum normalized TTC increased the odds of overtaking by a factor of 20.389. The predicted overtaking probability rose from a baseline of 12.4% to 74.3%.

The presence of a vehicle behind also increased overtaking likelihood. Its odds ratio was 3.781, and the predicted probability rose to 34.9%. That suggests the model responded to social driving context: a car waiting behind the AV makes prolonged following less neutral.

Passenger urgency did the opposite of what many readers might expect. Instead of making the model more likely to overtake, urgency reduced overtaking probability. The odds ratio was 0.418, with predicted probability falling to 5.6%.

Following time was not statistically significant. Increasing following time slightly reduced predicted overtaking probability from 12.4% to 11.9%, but the effect was not meaningful in the model.

Explanation constraint had a large negative effect. The odds ratio was 0.015, and the predicted probability dropped to about 0.2% under the constrained explanation condition.

Predictor change	Direction of effect	Predicted overtaking probability	Interpretation
Baseline	—	12.4%	Low default overtaking tendency under constrained baseline
Higher time-to-collision	Strongly positive	74.3%	Model tracks safety margin clearly
Vehicle behind present	Positive	34.9%	Model responds to rear-vehicle pressure
Passenger urgency present	Negative	5.6%	Model becomes more conservative, contrary to simple urgency intuition
Longer following time	Not significant	11.9%	Delay/cyclist-comfort cue is not independently tracked
Explanation constrained	Strongly negative	0.2%	Reasoning bandwidth affects decision behavior

This table is the article’s center of gravity. It shows why “the model reasons” is too crude.

The model tracked safety margin. It tracked rear-vehicle pressure. It did not meaningfully track following time. It responded to passenger urgency, but in the opposite direction from the expected efficiency-pressure story. It was also highly sensitive to explanation-length constraints.

That is not failure. It is diagnosis.

The model may be reasoning, but not the way your policy document imagines

A naïve interpretation would be: “The model responded to reasons, therefore it reasons like humans.”

No. Please do not do that. The procurement department has suffered enough.

The paper’s evidence is behavioral. It shows systematic associations between reason inputs, contextual variables, and decisions. It does not prove that the model internally represents normative reasons in the way a human driver, traffic psychologist, or safety engineer would.

This distinction is not academic hair-splitting. It affects how the framework should be used.

CARE-Drive can support claims such as:

“Under controlled prompt and scene conditions, this VLM’s decisions are sensitive to safety-margin changes.”
“Adding structured human reasons shifts decisions away from strict rule compliance toward expert-preferred overtaking in this scenario.”
“The model’s response to efficiency-related cues is weaker or counterintuitive.”
“Output-length constraints materially change the decision distribution.”

CARE-Drive should not support claims such as:

“The model understands human reasons.”
“The model’s explanations faithfully reveal its internal reasoning.”
“The model is generally safe for autonomous driving.”
“The model has solved meaningful human control.”

The paper is careful about this. It explicitly positions CARE-Drive as a behavioral evaluation framework. It does not claim direct access to internal causality. This is precisely why the framework is useful: it gives organizations a disciplined middle ground between blind trust in explanations and impossible demands for full mechanistic transparency.

The CARLA validation tests executability, not general driving safety

The paper also includes a CARLA simulator validation. The calibrated configuration is integrated into a simulated AV setup for selected scenarios. In the no-oncoming-vehicle scenario, the model consistently chooses to overtake. In the oncoming-vehicle scenario, it consistently chooses to stay behind. The maneuver is executed in simulation, with repeated queries used to reduce stochastic disagreement.

This is useful, but its role should be interpreted carefully.

The CARLA test shows that CARE-Drive’s decision output can be connected to executable vehicle behavior in a controlled simulation environment. It does not establish broad operational safety. It does not test the full distribution of road geometries, weather conditions, sensor failures, pedestrian interactions, or multi-agent negotiation. It is an implementation feasibility check, not a deployment license.

That is still worth having. A purely textual evaluation can become detached from physical action. The simulator bridge helps show that the evaluated decision can be operationalized. But the business reader should file this under “feasibility validation,” not “road-ready proof.”

What businesses should borrow from CARE-Drive

The immediate domain is automated driving, but the management lesson is broader. CARE-Drive gives a practical template for evaluating AI systems that make decisions in ambiguous, policy-sensitive environments.

The template is:

Step	Business translation
Define a decision with plausible competing actions	Choose cases where there is no trivial right answer
Identify human-relevant reasons	Translate policy, expert judgment, and stakeholder concerns into explicit reason categories
Create baseline and reason-augmented conditions	Test whether guidance changes decisions, not just explanations
Calibrate prompts or model settings	Reduce instability before drawing conclusions
Perturb context systematically	Vary the facts that should matter while holding irrelevant features fixed
Measure decision sensitivity	Use probabilities, odds ratios, or structured comparisons
Interpret uneven responsiveness	Identify which reasons the model tracks, ignores, or reverses

This matters for AI governance because many organizations currently treat explanation as evidence. A model says it considered safety, fairness, or compliance; the report screenshots the explanation; everyone nods; the risk committee moves on to sandwiches.

CARE-Drive suggests a better question: when safety, fairness, or compliance conditions change, does the decision change in the expected direction?

For AI assurance vendors, this is a product idea. For regulated enterprises, it is a testing discipline. For internal AI teams, it is a way to find whether prompt policies are actually functional or merely decorative.

The business value is diagnostic evidence, not model certification

The strongest practical use of CARE-Drive is not to certify a model as “safe.” It is to diagnose how the model behaves under controlled reason variation.

That has several concrete business uses.

First, it can improve model selection. In the paper, model capacity mattered. GPT-4.1 responded more robustly than smaller variants. A company choosing between model tiers should not evaluate only cost, speed, and generic benchmark scores. It should test whether cheaper models preserve the reason-responsiveness needed for the task.

Second, it can improve prompt and policy design. The human-reason specification changed behavior, but the effect depended on reasoning strategy and explanation length. That means policies embedded in prompts should be treated as testable controls, not as magical governance dust sprinkled over the model.

Third, it can reveal blind spots. In the paper, safety margin and rear pressure mattered; following time did not; urgency made the model more conservative. For a real deployment, such asymmetry is exactly the point. You do not want to discover after launch that the model ignores the very context variable your policy team considered central.

Fourth, it can create evidence for audits. CARE-Drive-style tests produce structured records: prompt versions, context variables, repeated runs, decision probabilities, estimated sensitivities, and boundary conditions. This is more useful than a folder of charming explanations.

Finally, it can help separate “alignment theater” from working alignment. If a model’s explanations change but its decisions do not, you have a communication layer, not a control layer. Very nice for demos. Less nice for liability.

Where the paper’s evidence stops

The paper’s limitations are not decorative. They define how the result should be used.

First, CARE-Drive evaluates observable behavior, not internal reasoning. It can show that decisions systematically change with reason-relevant variables. It cannot prove that the model internally represents or causally uses those reasons in a human-like way.

Second, the reason specification is prompt-based. Different formulations of the same normative principles may produce different results. That is not a small issue. In enterprise settings, policy wording often varies across departments with the serene inconsistency of committee work.

Third, the experiment focuses on one main driving scenario: overtaking a cyclist under legal and safety trade-offs. The framework is general, but the empirical findings are not automatically general across merging, yielding, pedestrian negotiation, emergency maneuvers, or dense urban traffic.

Fourth, the tested models are GPT-based VLMs available to the authors under their access and cost constraints. The framework can be applied to newer or different models, but the results should not be casually transferred.

Fifth, repeated runs per condition are limited. Thirty runs per condition is enough to reveal patterns in this study, but higher-stakes validation would require larger samples, more scenario diversity, and stricter statistical protocols.

These boundaries do not weaken the article’s main point. They prevent the wrong point from being extracted.

CARE-Drive is not a certificate. It is a microscope.

The sharper question for safety-critical AI

The most useful sentence to take from this paper is not “VLMs can reason about driving.” That sentence is too broad, too tempting, and too likely to become a conference slide with a stock image of a highway.

The better sentence is:

A model’s explanations should be tested against its decision sensitivity.

If safety margin increases and the model does not become more willing to overtake, something is off. If passenger urgency appears and the model becomes more conservative, that may be acceptable, but it needs interpretation. If longer following time is supposed to represent cyclist discomfort and traffic delay, but the model ignores it, the governance claim should be narrowed. If shorter explanations nearly eliminate overtaking, then latency optimization is not merely a performance tweak; it is a behavioral intervention.

That is the kind of evidence businesses need before placing AI into safety-critical workflows.

Not because every system must reason like a human. It may not. It probably does not. But if an organization claims that its AI follows human-relevant reasons, the claim should survive perturbation.

CARE-Drive gives us a way to ask: when the reasons change, does the decision follow?

That is a better question than “does the model sound responsible?”

Sounding responsible is easy. Even a chatbot can do it. Especially a chatbot.

Cognaptus: Automate the Present, Incubate the Future.

Lucas Elbert Suryana, Farah Bierenga, Sanne van Buuren, Pepijn Kooij, Elsefien Tulleners, Federico Scari, Simeon Calvert, Bart van Arem, and Arkady Zgonnikov, “CARE-Drive: A Framework for Evaluating Reason-Responsiveness of Vision–Language Models in Automated Driving,” arXiv:2602.15645, 2026. ↩︎

The core test: do decisions move when reasons move?#

CARE-Drive separates prompt instability from reason-responsiveness#

The overtaking scenario is not a toy, even if it looks simple#

Baseline models chose rule compliance every time#

Tree-of-Thought won calibration, but calibration is not the final claim#

Stage 2 asks which reasons the model actually tracks#

The model may be reasoning, but not the way your policy document imagines#

The CARLA validation tests executability, not general driving safety#

What businesses should borrow from CARE-Drive#

The business value is diagnostic evidence, not model certification#

Where the paper’s evidence stops#

The sharper question for safety-critical AI#