Agents of Disruption: How LLMs Became Adversarial Testers for Autonomous Driving

TL;DR for operators

AGENTS-LLM is not another attempt to make a language model dream up an entire traffic world and then hope the simulator forgives the hallucination. It does something narrower and more operationally useful: it takes an existing real-world driving scenario, accepts a natural-language instruction such as adding a parked vehicle, jaywalker, accident site, or construction zone, and produces an augmented scenario that can be executed in closed-loop autonomous-driving simulation.¹

That distinction matters. Autonomous-driving validation teams do not merely need more synthetic scenes. They need more credible hard scenes: scenarios close enough to recorded traffic distributions to be useful, but modified enough to expose planner weaknesses. The paper’s bet is that LLM agents are useful not because they are poetic traffic storytellers, but because they can translate human scenario intent into structured edits, check their own work, and use tools when geometry becomes too annoying for pure language reasoning.

The evidence is mixed in the useful way. GPT-4o performs strongly even with a simple one-shot modifier. Smaller or cheaper models start weaker, but function calling and QA loops narrow the gap. Function calling helps most with placement accuracy. Visual QA helps more with expert-perceived visual quality than with displacement error. Those are different things, which is why the paper is more interesting than the usual “agents good, baseline bad” theatre.

For business use, the practical opportunity is scenario multiplication: turning a finite set of real driving logs into a larger catalogue of adversarial validation cases. The paper does not prove automated safety certification, regulator-ready assurance, or universal robustness across planner stacks. It does suggest a credible workflow for reducing expert bottlenecks in AV testing, especially where teams already use simulation and need more long-tail cases without manually crafting every miserable little traffic surprise. Progress, apparently, sometimes looks like teaching an LLM to place a badly parked car in exactly the wrong place.

The useful trick is editing reality, not inventing it

Autonomous-driving validation has a data problem with a cruel sense of humour. The events that matter most are the ones that occur least often. A smooth lane-following scene is easy to collect. A pedestrian crossing at the wrong moment, a vehicle blocking sightlines before an intersection, a construction zone forcing a nudge manoeuvre, or an accident site that compresses decision time—those are harder to capture, harder to stage, and not the sort of thing one should enthusiastically manufacture on public roads.

The obvious response is synthetic generation. Generate more scenes, test more planners, sleep more peacefully. Sadly, synthetic driving scenes have a habit of drifting away from the distribution that real planners will actually face. This is particularly awkward for learning-based planners: if the test world is too unlike the deployment world, the evaluation becomes less a safety test and more a simulator personality quiz.

AGENTS-LLM takes the less glamorous route. It starts from real scenarios and modifies them. That is the paper’s central correction to a common misconception: the LLM is not being asked to invent a complete driving universe from scratch. It is being asked to perform controlled augmentation. The original map, lane geometry, traffic context, and simulator compatibility remain anchored in real-world data. The LLM’s job is to insert or modify the relevant traffic agents and objects so that a normal scene becomes a harder one.

This is the difference between painting a fake city and moving a traffic cone in a real street photo. The second task is smaller. It is also exactly the kind of smaller task that tends to survive contact with operational workflows.

The mechanism begins with a scenario modifier

The framework has a simple spine. A Scenario Modifier Agent receives three things: a structured representation of the original scenario, a natural-language instruction, and enough prompt scaffolding to produce a modified scenario vector. Its output is then either sent directly into the interPlan-compatible augmentation interface or passed into a QA loop for checking and revision.

The paper’s scenario representation is deliberately textual. Agents, lanes, lane connectors, and areas are encoded as lists of attributes. Agents include type, centre coordinates, heading, dimensions, velocity, and lane ID. Lanes include direction, relative direction to the ego vehicle, width, speed limit, and sampled coordinates. Areas are represented by boundary points. This is not the prettiest part of the paper. It is, however, the part that makes the agent usable.

A language model cannot reliably manipulate raw simulator internals unless those internals are exposed in a form it can reason over. The text representation acts as an operating surface. It gives the LLM something structured enough to edit and human-readable enough to follow. In business terms, this is the integration layer. Nobody gets promoted for the integration layer, but the demo collapses without it.

The paper studies four prompting or agentic variants:

Variant	Lane representation	QA loop	Likely purpose in the paper
One-time modifier, or OTM	Polyline	None	Main baseline: what happens when the LLM edits once without extra help
Function calling, or FC	Bézier	None	Tool-use test: whether retrieving lane points reduces placement errors
Text QA, or tQA	Polyline	Text-only	Agent-loop test: whether critique and feedback improve outputs
Visual QA, or vQA	Polyline	Visual-text	Multimodal QA test: whether BEV image inspection improves perceived scenario quality

This design makes the paper mechanism-first almost by force. The result does not come from a single magic prompt. It comes from how representation, tool use, and QA are combined.

Tool use attacks the most boring failure mode: coordinates

The paper’s most important engineering insight is almost offensively mundane: LLMs are bad at precise placement unless the environment gives them help.

In the one-time modifier setting, the framework asks the model to place new agents using the text description and polyline lane coordinates. The authors then render the outputs and manually classify common failures. They identify three categories: position errors, heading errors, and logic errors.

The numbers are revealing:

Model, OTM variant	Total errors	Position errors	Heading errors	Logic errors
GPT-4o	5	3	0	2
Gemini-1.5-Flash	15	6	2	7
Llama3.1-70B	16	11	3	2

The dominant problem, especially for Llama3.1-70B, is not grand semantic misunderstanding. It is putting the thing in the wrong place. The parked car is not parked where it should be. The jaywalker is not where the instruction implies. The agent fails to retrieve or infer the correct lane anchor. This is not artificial general intelligence wrestling with metaphysics. This is a geometry intern needing a calculator.

Function calling addresses exactly that. Instead of giving the model only sampled polyline coordinates, the FC variant uses a Bézier lane representation and lets the modifier call a function to retrieve a relevant point along a lane or connector. The LLM decides when to call the function and what arguments to pass. That external tool reduces the need for the model to fake spatial precision from token patterns.

The advanced-prompting comparison is therefore best read as an ablation-style test, even if the paper does not frame it with that exact label. The purpose is to isolate whether tool use and QA loops narrow the gap between a frontier model and cheaper models. The answer is yes, with a caveat: function calling most clearly improves placement accuracy, especially for the smaller models, while GPT-4o has less room to improve because its one-shot baseline is already strong.

That is a useful operational lesson. If the failure mode is coordinate retrieval, buy or build a geometry tool. Do not ask a language model to become a surveyor through moral encouragement.

QA loops improve judgement, but not always the metric you expected

The QA loops are where the paper becomes more subtle. The text QA agent receives the original scenario, the user instruction, the modified vectors, and a list of common observed mistakes. It summarises the situation, generates verification questions, answers them, and rates the output on compliance with the instruction, realism, and logical consistency. If the average score is below 4 out of 5, it sends step-by-step feedback to the modifier for regeneration.

The visual QA variant is more elaborate. A QA Engineer generates critical questions. A vision-language model inspects a bird’s-eye-view image of the modified scenario and answers those questions. The QA Engineer then uses those answers to identify mistakes and provide feedback to the modifier.

This looks like the sort of multi-agent machinery that can become fashionable very quickly and useful somewhat later. The paper’s results help separate the two.

For displacement error, visual QA does not beat text QA on average. The paper is careful here: vQA was evaluated only for Gemini-1.5-Flash, because it is token-heavy, multimodal, and cost-inefficient to run with an already strong frontier model such as GPT-4o. It is also not applicable to Llama3.1-70B in their setup because of the vision requirement.

So the vQA result is not “vision solves scenario generation.” It is narrower: visual QA may help produce outputs that human experts perceive as more convincing, even when point-level displacement error does not improve. That distinction is the whole point.

A few metres of displacement error can matter enormously when the intended scenario is a stopped vehicle blocking sightlines before an intersection. In another case, such as a construction zone, exact cone placement may be less important than the overall scene structure. One metric sees centre-point distance. A human judge sees whether the scenario makes sense. Both are useful. Neither is the whole truth. Metrics, as ever, are little bureaucrats with narrow job descriptions.

The evaluation stack measures three different claims

The paper uses several evaluation components, and they should not be collapsed into one generic “performance” statement. Each test answers a different question.

Evaluation component	Likely purpose	What it supports	What it does not prove
Displacement error against human interPlan modifications	Main quantitative evidence for placement accuracy	Whether generated agents are close to expert-placed reference agents	Whether the whole scene looks realistic or is equally useful for planner testing
Manual error categorisation	Diagnostic analysis	Which failure modes dominate: position, heading, or logic	Precise general error rates outside the 50-scenario catalogue
LCTGen comparison	Comparison with prior language-based scenario work	AGENTS-LLM better fits pedestrian and construction-style augmentations and preserves fine-grained instruction detail	Universal superiority over all generative scenario methods
Expert pairwise Elo ranking	Main human-perception evidence	Whether experts prefer or accept outputs visually compared with interPlan and other variants	Full safety validity or simulator-wide realism
nuPlan closed-loop planner scores	Downstream usefulness test	Whether generated scenarios challenge a strong planner similarly to human-crafted interPlan cases	Whether AV systems become safer, or whether all planner families are equally stressed

This separation matters for business readers. A validation lead does not need one heroic number. They need to know which part of the workflow is becoming more reliable. Placement accuracy helps with precise scenario design. Expert ranking helps with perceived realism and intent satisfaction. Planner scores help with downstream challenge. Conflating those would be convenient. It would also be sloppy, and we are trying not to audition for a vendor webinar.

Human experts could barely separate GPT-4o from manual interPlan scenarios

The human ranking experiment is one of the stronger parts of the paper. Nine autonomous-driving research experts performed 5,760 pairwise comparisons across eight AGENTS-LLM variants and human-generated interPlan scenarios. The paper computes Elo ratings with bootstrapped 95% confidence intervals.

The top results are close:

Scenario source or variant	Rank	Elo	95% CI	Votes
interPlan	1	1042	-9 / +11	1960
GPT-4o OTM	1	1039	-9 / +11	1720
Gemini-1.5-Flash vQA	1	1025	-12 / +13	720
Llama3.1-70B tQA	3	1011	-16 / +15	600
Gemini-1.5-Flash tQA	3	1003	-15 / +15	600
Gemini-1.5-Flash FC	4	998	-10 / +9	1360
Llama3.1-70B FC	5	984	-8 / +10	1360
Gemini-1.5-Flash OTM	8	953	-12 / +12	1600
Llama3.1-70B OTM	8	941	-13 / +11	1600

The headline is not that the machine “beats humans.” It does not. The useful headline is that GPT-4o with the simplest one-time modifier is almost indistinguishable from human-created interPlan scenarios in blind expert comparison. Gemini with visual QA also reaches the top confidence-overlapping group. Meanwhile, the one-shot versions of Gemini and Llama sit clearly lower.

This is exactly the kind of result operators should prefer: not magical, but directional. Strong models can do the job with less scaffolding. Smaller models need scaffolding. Visual QA improves what human judges notice. Function calling improves coordinate placement. There is no single lever. There is a toolbox.

The comparison with LCTGen adds another practical note. LCTGen, adapted to the interPlan catalogue, failed to generate traffic agents for jaywalker and construction-site types because of its vehicle-centric design. Where it did generate agents, the paper reports larger per-category errors. In one qualitative example, LCTGen places a vehicle behind an intersection and in the wrong lane despite only moderate displacement error. That example is useful because it shows why a numeric distance can be too forgiving. A wrong lane is not a small aesthetic disagreement. It is the scenario missing its job.

Planner scores show downstream stress, not certification

The final test asks whether these generated scenarios actually challenge an autonomous-driving planner. The authors use the nuPlan simulator and evaluate PDM-Closed, described in the paper as a top-performing nuPlan planner. PDM-Closed generates candidate trajectories using IDM-style rollouts, evaluates them for safety, progress, and comfort, and selects the best one. The authors also adjust the proposal sampling to allow broader lateral deviations, which is important for scenarios where the centreline is blocked.

The mean driving scores are blunt:

Scenario set	Mean driving score
nuPlan Val14	90.8%
interPlan	51.9%
AGENTS-LLM, GPT-4o OTM	49.6%
AGENTS-LLM, Gemini-1.5-Flash OTM	53.5%
AGENTS-LLM, Llama3.1-70B OTM	54.0%

The interpretation is straightforward but easy to overstate. Val14 is close to saturated for this planner. interPlan makes the task much harder. The AGENTS-LLM OTM scenarios produce similarly low scores, meaning they stress the planner at roughly the same level as the human-crafted interPlan augmentations.

That supports the paper’s operational claim: the generated scenarios are not merely plausible to look at; they can be useful in closed-loop planner evaluation. But this is not a safety certificate. It is one planner, one simulator stack, one scenario catalogue, and one style of augmentation interface. A lower planner score means the scenario is challenging. It does not automatically mean the generated scenario is realistic, comprehensive, regulator-ready, or representative of all real-world risk.

Still, for validation teams, this is the relevant kind of progress. A hard scenario that runs in the existing simulator is worth more than a gorgeous synthetic traffic scene that cannot be integrated without three weeks of engineering grief and a small act of prayer.

What changes for autonomous-driving validation teams

The business relevance of AGENTS-LLM is not “replace safety engineers with agents.” That would be reckless, tedious, and legally exciting in all the wrong ways.

The more grounded interpretation is workflow leverage. Today, expert-crafted long-tail scenarios are valuable because they encode safety intuition, map context, and planner challenge. They are also expensive because experts have limited time. AGENTS-LLM points toward a semi-automated process:

Start with real recorded driving logs.
Select scenarios that are plausible candidates for stress testing.
Use natural-language instructions to specify the desired augmentation.
Let the LLM modifier produce structured scenario edits.
Use tool calls for geometry-sensitive placement.
Use QA loops for intent checking and visual plausibility.
Run the resulting scenarios in closed-loop simulation.
Route failures and suspicious cases back to human reviewers.

This does not remove the expert. It moves the expert from handcrafting every scenario to designing scenario families, reviewing failures, calibrating instructions, and auditing outputs. That is a better use of scarce judgement.

The ROI pathway is clearest in three places.

First, regression testing. Once a planner changes, teams need to know whether old weaknesses return or new ones appear. A larger catalogue of real-data-grounded adversarial scenarios makes regression testing more informative.

Second, coverage expansion. Real logs underrepresent rare events. Augmentation can multiply scenario variants around intersections, parked vehicles, jaywalkers, construction zones, and accident sites without pretending that every variant was naturally collected.

Third, scenario authoring speed. Natural language lowers the cost of translating safety hypotheses into simulator-ready tests. The cost does not fall to zero. It moves from manual vector editing to instruction design, automated generation, QA, and selective expert review. That is still a better bargain.

Technical contribution	Operational consequence	ROI relevance
Natural-language-guided augmentation of real scenarios	Faster creation of long-tail variants from existing logs	Reduces expert bottleneck in scenario authoring
Text-based scenario representation	LLMs can edit simulator-compatible structures	Lowers integration friction if representation maps to internal tools
Function calling for lane coordinates	Better placement accuracy for smaller models	Reduces dependence on expensive frontier APIs for geometry-heavy edits
Text and visual QA loops	Automated checking before human review	Improves reviewer throughput and catches obvious failures
Closed-loop planner evaluation	Generated scenarios can stress an actual planner	Connects generation to validation, not just content creation

The quiet strategic value is that this makes safety testing more iterative. A planner fails. Engineers inspect the failure. They generate neighbouring variants. The planner improves. The catalogue expands. Over time, the scenario library becomes less a static benchmark and more a living adversarial memory. Very corporate. Very useful.

Where the result should not be stretched

The paper is strong enough that it does not need inflated claims. Its boundaries are clear.

The evaluation recreates 50 interPlan human-augmented scenarios. That is meaningful, but not a universe. The scenario categories come from interPlan’s catalogue: construction zones, accident sites, jaywalkers, nudging around parked vehicles, and overtaking obstacles with oncoming traffic. These are important, but they are not the full ecology of autonomous-driving risk.

The expert ranking uses nine judges and 5,760 comparisons, which is substantial for human evaluation but still bounded by the expertise, visual presentation, and scenario set used. Elo ranking gives a useful relative preference signal, not an absolute certificate of realism.

Visual QA is promising but narrow. It was evaluated only for Gemini-1.5-Flash because of cost and modality constraints. The paper explicitly notes that running vQA with a frontier model would be cost-inefficient given GPT-4o’s already strong OTM performance, and Llama3.1-70B was ruled out by the visual-input requirement. So vQA should be treated as an exploratory extension within the framework, not as a fully general result across model classes.

The planner benchmark uses PDM-Closed in nuPlan. That is a serious downstream test, but it does not establish general planner robustness or cross-simulator validity. A scenario that breaks one planner may not break another. A scenario that looks plausible in bird’s-eye view may still miss subtleties of behaviour, perception, or real-world interaction.

Finally, the framework depends on a structured scenario representation and compatibility with the augmentation interface. Organisations with messy simulation pipelines—an entirely hypothetical category, of course—would need integration work before seeing value.

The paper’s deeper lesson is about agents as test infrastructure

AGENTS-LLM is easy to misread as another “LLMs for autonomous driving” paper. That framing is too broad. The more precise lesson is that agentic LLMs are becoming test infrastructure.

They are not driving the car. They are not replacing the planner. They are not solving perception, control, or safety validation. They are acting as adversarial scenario editors: translating human intent into simulator changes, calling tools when spatial reasoning gets brittle, and using QA loops to catch obvious mistakes before a human has to look.

That is a respectable job. More importantly, it is a job with a clear operational interface. Inputs are existing scenarios and natural-language augmentation requests. Outputs are modified scenario vectors and closed-loop simulation cases. Evaluation can be done with displacement error, expert preference, and planner challenge. This is the kind of AI workflow that can be audited, improved, and priced. A rare pleasure.

For autonomous-driving businesses, the implication is not that LLMs have suddenly solved long-tail safety. They have not. The implication is that the scenario-generation bottleneck may be partially automatable without abandoning real-world grounding. That is a smaller claim. It is also the one worth paying attention to.

The best version of this technology would not be an autonomous “chaos monkey” spraying random hazards into a simulator. It would be a disciplined scenario workbench: humans define risk hypotheses, agents generate variants, tools enforce geometry, QA loops catch nonsense, planners are tested, and experts review the residue.

Safety-critical engineering rarely advances through one grand breakthrough. More often, it advances when a painful manual process becomes repeatable enough to scale. AGENTS-LLM is interesting because it points to exactly that: not artificial driving intelligence, but artificial testing labour. Less cinematic, more useful. Naturally, the useful version gets fewer headlines.

Cognaptus: Automate the Present, Incubate the Future.

Yu Yao, Salil Bhatnagar, Markus Mazzola, Vasileios Belagiannis, Igor Gilitschenski, Luigi Palmieri, Simon Razniewski, and Marcel Hallgarten, “AGENTS-LLM: Augmentative GENeration of Challenging Traffic Scenarios with an Agentic LLM Framework,” arXiv:2507.13729, submitted July 18, 2025, https://arxiv.org/abs/2507.13729. ↩︎

TL;DR for operators#

The useful trick is editing reality, not inventing it#

The mechanism begins with a scenario modifier#

Tool use attacks the most boring failure mode: coordinates#

QA loops improve judgement, but not always the metric you expected#

The evaluation stack measures three different claims#

Human experts could barely separate GPT-4o from manual interPlan scenarios#

Planner scores show downstream stress, not certification#

What changes for autonomous-driving validation teams#

Where the result should not be stretched#

The paper’s deeper lesson is about agents as test infrastructure#