Agents of Disruption: How LLMs Became Adversarial Testers for Autonomous Driving

The promise of fully autonomous vehicles hinges on their ability to handle not just the average drive—but the unexpected. Yet, creating rare, safety-critical scenarios for testing autonomous driving (AD) systems has long been a bottleneck. Manual scene creation doesn’t scale. Generative models often drift away from real-world distributions. And collecting edge cases on the road? Too dangerous, too slow.

Enter AGENTS-LLM, a deceptively simple yet powerful framework that uses Large Language Models (LLMs) not to solve traffic scenes, but to break them. The twist? These aren’t just static prompts or synthetic scripts. AGENTS-LLM organizes LLMs into a multi-agent, modular system that modifies real traffic scenarios with surgical precision—making them trickier, nastier, and far more useful for evaluating planning systems.

The Problem with Long-Tail Driving Events

Autonomous driving systems are well-tuned for common driving patterns but struggle with “long-tail” edge cases: a jaywalker stepping out unexpectedly, a car stalled just past an intersection, a construction site blocking optimal paths. These events are:

Rare in real datasets, requiring millions of hours of driving to capture just a few.
Too sensitive to stage manually at scale, due to ethical or safety concerns.
Difficult to synthesize without creating unrealistic or distribution-shifting noise.

AGENTS-LLM reframes the problem by augmenting real scenarios—not generating them from scratch. This retains naturalistic realism while injecting controlled doses of chaos.

How It Works: A Multi-Agent LLM Framework

The system relies on an agentic architecture with specialized roles:

Scenario Modifier Agent (SMA): Takes original scenario vectors + user instructions (“Add a stalled car 17.5m ahead of ego”) and rewrites the scene.
Quality Assurance Agents (QA): Two variants: one purely text-based (Text QA), the other incorporating visual BEV imagery (Visual QA). Both verify whether the modification matches intent.

These agents can even call functions—e.g., to find lane coordinates or calculate distances. The SMA doesn’t need to “understand” the map deeply; it can ask the right questions.

Role	Input	Output
SMA	Scenario + natural language	Modified scenario vectors
QA Agent	Modified scenario + spec	Feedback or approval
Visual QA	Rendered scene image	Visual consistency check

This modular structure allows the use of smaller, open-weight LLMs (like Llama3.1-70B or Gemini-1.5) to produce results on par with GPT-4o—if paired with good prompting and tool use.

The Results: Less Cost, More Chaos

Three results stand out:

Competitive Realism: In expert blind trials (using Elo ratings), GPT-4o outputs were indistinguishable from human-created scenarios. With visual QA, even Gemini-1.5 nearly closed the gap.
Challenging for State-of-the-Art Planners: When run in closed-loop simulations using nuPlan and the PDM-Closed planner, scenarios generated by AGENTS-LLM reduced planner driving scores to the same level as handcrafted interPlan scenes (~50%).
Low-Cost Performance Scaling: Instead of relying on GPT-4o’s brute force, function-calling and QA loops enabled cheaper models to perform well. Function-calling reduced placement errors by mitigating common failure modes like wrong agent positioning or heading.

Model	Mean Driving Score (%)
nuPlan Val14 (normal cases)	90.8
interPlan (manual edge cases)	51.9
AGENTS-LLM (GPT-4o)	49.6
AGENTS-LLM (Gemini-1.5)	53.5
AGENTS-LLM (Llama3.1)	54.0

Why This Matters for Safety-Critical AI

AGENTS-LLM hints at a broader shift in how we evaluate intelligent systems:

We don’t just need AI that performs. We need AI that can probe performance.

By positioning LLMs as adversarial agents rather than task-solvers, this framework turns them into low-cost QA testers. Just as fuzzing transformed software testing, AGENTS-LLM could usher in the age of autonomous adversaries—stress-testing not just cars, but financial systems, customer service bots, and AI decision-makers.

Cognaptus clients working in logistics, robotics, and compliance-heavy domains should take note: adversarial agents are not just a safety tool—they are a strategic asset.

Cognaptus: Automate the Present, Incubate the Future

The Problem with Long-Tail Driving Events#

How It Works: A Multi-Agent LLM Framework#

The Results: Less Cost, More Chaos#

Why This Matters for Safety-Critical AI#

The Problem with Long-Tail Driving Events

How It Works: A Multi-Agent LLM Framework

The Results: Less Cost, More Chaos

Why This Matters for Safety-Critical AI