CAR-bench: When Agents Don’t Know What They Don’t Know

Opening — Why this matters now

LLM agents are no longer toys. They book flights, write emails, control vehicles, and increasingly operate in environments where getting it mostly right is not good enough. In real-world deployments, the failure mode that matters most is not ignorance—it is false confidence. Agents act when they should hesitate, fabricate when they should refuse, and choose when they should ask.

The CAR-bench paper arrives at exactly the right moment. It does not ask whether agents can complete tasks. It asks whether they can do so consistently, under uncertainty, and with a calibrated understanding of their own limits. The answer, empirically, is: not yet.

Background — From capability benchmarks to deployment reality

Most existing agent benchmarks are built for best‑case scenarios. Tasks are fully specified, tools exist, parameters are known, and success is binary. If the model reaches the correct end state once, it “passes.”

But production systems—especially safety‑critical ones like in‑car assistants—do not operate this way. Users give ambiguous commands. Required tools are missing. Policies constrain what can be done and when. And agents must decide how to proceed, not just what to do.

CAR‑bench builds directly on this gap. Inspired by earlier work like τ‑bench, it shifts the evaluation target from raw task execution to deployment‑grade reliability.

Analysis — What CAR‑bench actually tests

CAR‑bench simulates a realistic automotive assistant environment with:

58 interconnected tools across navigation, charging, vehicle control, productivity, and weather
19 domain‑specific policies, including safety and confirmation rules
Multi‑turn dialogue with an LLM‑simulated user
A mutable environment with persistent state

Crucially, it introduces two task types that most benchmarks ignore:

1. Hallucination tasks — When the task is impossible

In these tasks, something essential is deliberately removed:

a required tool
a tool parameter
or a tool result

The task cannot be completed. Success is defined not by workaround creativity, but by explicit acknowledgment of inability. Fabrication—implicit or explicit—counts as failure.

2. Disambiguation tasks — When acting is premature

Here, the task is solvable, but only after resolving ambiguity. Agents must:

Detect that ambiguity exists
Resolve it internally if possible (via tools, context, or preferences)
Ask the user only if necessary

Premature action, unnecessary clarification, or guessing all fail the task.

Measuring what actually matters: consistency

CAR‑bench reports both:

Pass@k — Did the agent succeed at least once?
Pass^k — Did it succeed every time?

This distinction turns out to be devastating.

Findings — Capability without reliability

Across state‑of‑the‑art models, three results stand out.

1. The consistency cliff

Even frontier reasoning models show large gaps between potential and reliability. For disambiguation tasks, GPT‑5 drops from roughly 68% Pass@3 to 36% Pass³. The model knows how to solve the task—but fails to do so consistently.

This is not a scaling problem. It is a control problem.

2. Thinking helps—but only partially

Reasoning‑enabled models outperform non‑thinking ones across all task types, especially as task complexity increases. They:

Reduce logical errors
Lower explicit hallucination rates
Improve policy compliance

But they do not fix premature action. In disambiguation tasks, even thinking models act too early in ~90% of failures.

Reasoning helps agents think better. It does not yet make them wait better.

3. Hallucination is still the default escape hatch

When a request cannot be satisfied, models face a choice:

Admit limitation
Or fabricate plausibly

Non‑thinking models often hallucinate outright. Thinking models improve—but frequently switch to implicit fabrication, quietly omitting missing checks while presenting confident outcomes.

This behavior is not accidental. It is reinforced by training regimes that reward completion over refusal.

Snapshot: overall performance (Pass³, averaged)

Task Type	Best Models Plateau
Base tasks	~50–55%
Hallucination	~60%
Disambiguation	<50% (none exceed)

No model is deployment‑ready by this standard.

Implications — Why this changes how we should build agents

CAR‑bench exposes a structural tension at the heart of today’s LLM agents: completion vs. compliance.

Agents are optimized to satisfy the user quickly. Policies, uncertainty checks, and self‑restraint are secondary objectives—and they lose when the system is under pressure.

For businesses and system designers, three implications follow:

Do not equate Pass@1 with readiness. One‑off success hides brittle behavior.
Reasoning tokens are not a silver bullet. Control flow and decision gating matter more.
Refusal is a first‑class capability. If your agent cannot say “I can’t,” it cannot be trusted.

In safety‑critical or regulated environments, this suggests a hybrid future:

LLMs for planning and interaction
Hard system layers for validation, gating, and enforcement

In other words: less autonomy, more architecture.

Conclusion — A benchmark that tells an uncomfortable truth

CAR‑bench does something rare: it makes progress look smaller, not larger. And that is precisely why it is valuable.

The benchmark shows that today’s agents are often capable but unreliable, confident but fragile, and helpful until they shouldn’t be. Scaling models alone will not fix this. What is missing is calibrated self‑awareness—knowing when not to act.

Until agents learn that, the real bottleneck is not intelligence. It is restraint.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — From capability benchmarks to deployment reality#

Analysis — What CAR‑bench actually tests#

1. Hallucination tasks — When the task is impossible#

2. Disambiguation tasks — When acting is premature#

Measuring what actually matters: consistency#

Findings — Capability without reliability#

1. The consistency cliff#

2. Thinking helps—but only partially#

3. Hallucination is still the default escape hatch#

Snapshot: overall performance (Pass³, averaged)#

Implications — Why this changes how we should build agents#

Conclusion — A benchmark that tells an uncomfortable truth#