Label Now, Drive Later: Why Autonomous Driving Needs Fewer Clicks, Not Smarter Annotators

Clicks are a cost centre.

In a 3D annotation tool, deleting an unnecessary bounding box may take one or two seconds. Creating a missed vehicle annotation from scratch takes about 23 seconds. Correcting a poorly positioned box falls somewhere in between.

These actions may all count as model errors. They do not cost the same amount of human time.

That distinction is the central contribution of Semi-Automated Data Annotation in Multisensor Datasets for Autonomous Vehicle Testing, a report describing the annotation infrastructure developed for Poland’s DARTS autonomous-driving dataset.¹ The paper introduces the Correction Acceleration Ratio, or CAR, to evaluate a detector according to the labour required to repair its predictions rather than its accuracy score alone.

The result is an uncomfortable lesson for anyone selecting AI systems from a leaderboard: a detector with higher Average Precision can still be the more expensive colleague.

The leaderboard changes when errors are priced in seconds

Average Precision, or AP, is useful when the question is whether a detector correctly identifies objects under a benchmark’s evaluation rules.

It is less useful when the question is how long an annotator will spend repairing the detector’s output.

The DARTS researchers compare several 3D object detectors using both AP and CAR. The differences between the two rankings are where the paper becomes operationally interesting.

Dataset	Models compared	What AP suggests	What CAR suggests
nuScenes	DSVT and TransFusion-L	Both score 0.87	DSVT reaches 0.87 CAR, while TransFusion-L reaches 0.83
nuScenes	CenterPoint and PV-RCNN++	CenterPoint leads, 0.85 versus 0.80	PV-RCNN++ leads, 0.84 versus 0.76
KITTI	PV-RCNN++ and CenterPoint	PV-RCNN++ leads, 0.84 versus 0.80	Both reach 0.85 CAR
KITTI	PointPillars and DSVT	PointPillars leads, 0.78 versus 0.76	Both reach 0.83 CAR

The nuScenes comparison between CenterPoint and PV-RCNN++ is especially revealing. CenterPoint has the stronger AP score, yet PV-RCNN++ produces outputs that the CAR model estimates will require less correction time.

The issue is not that AP is defective. AP answers a different question.

A model-development team wants to know whether detections are correct. An annotation-operations team wants to know how many minutes remain before the labels are usable. Those questions overlap, but they are not interchangeable.

The existing habit is to select the most accurate model and assume annotation productivity will follow. The evidence here shows why that assumption can fail: accuracy metrics generally count errors according to benchmark rules, while annotation cost depends on the type of repair each error demands.

A missed vehicle is not equivalent to an extra box

To build CAR, the researchers first needed to estimate the human cost of different errors.

They created modified annotations from the nuScenes dataset, loaded them into the Supervisely annotation tool, and asked trained annotators to correct deliberately introduced mistakes. Measurements covered three object classes—cars, pedestrians, and cyclists—and two distance ranges.

The average correction times differed substantially:

A false negative, where the detector misses an object entirely, took about 23 seconds to correct.
A false positive or classification error generally took about one to two seconds.
Translation, rotation, and scale errors required roughly five to sixteen seconds, depending on the correction.

This cost structure explains why a detector can perform well statistically while remaining inconvenient operationally.

A false positive is usually cheap: find the unnecessary box and delete it. A false negative is expensive because the annotator must discover the missing object, create a new 3D box, position it, rotate it, scale it, and classify it. The annotation interface does not care that both cases occupy one row in an error report. Payroll does.

The paper also defines combined positional errors, including translation with rotation, rotation with scale, translation with scale, and errors involving all three. These combinations matter because correction work is not neatly additive. Adjusting a box’s location and orientation together may take less time than performing two isolated actions consecutively.

That is a small methodological detail with a large practical consequence. Once labour becomes the unit of evaluation, the metric must represent how people actually repair outputs rather than how conveniently researchers can count mistakes.

CAR converts model errors into estimated annotation labour

In simplified notation, the total estimated correction time is:

$$ T_{\text{correction}} = \sum_{e \in E} t_e n_e $$

where $t_e$ is the average time required to correct error type $e$, and $n_e$ is the number of observed errors of that type.

The fully manual baseline is:

$$ T_{\text{manual}} = t_{\text{create}}N $$

where $t_{\text{create}}$ is the average time required to create one annotation from scratch, and $N$ is the number of ground-truth objects.

CAR is then:

$$ CAR = 1 - \frac{T_{\text{correction}}}{T_{\text{manual}}} $$

The interpretation is deliberately practical:

$CAR = 1$ means no human correction time is required.
$CAR = 0$ means the pre-annotations provide no estimated time saving.
$CAR < 0$ means correcting the automated output is estimated to take longer than manual annotation from scratch.

This makes CAR more than another perception metric. It is a compact economic model of a human-in-the-loop workflow.

The metric does not merely ask whether the machine was wrong. It asks what kind of wrongness the machine produced, how frequently it occurred, and how costly it was for a person to repair.

The manual timing study is the foundation, not a side experiment

The detector benchmark receives the most attention because it produces the headline CAR values. However, the manual timing study is what gives those values meaning.

Without measured correction times, CAR would reduce to an arbitrary weighting scheme. The paper instead grounds its weights in observed annotation actions.

The researchers also calibrate thresholds for translation, rotation, and scale errors using manually annotated nuScenes sequences in two annotation tools, Supervisely and Scalabel. Thresholds are selected so that 90% of manual annotations fall within them.

This calibration step is best understood as a way to distinguish meaningful model errors from normal human variation. Two annotators may place a box slightly differently without either annotation being operationally defective. A metric that charges the model for every minor deviation would exaggerate repair work.

The use of two tools gives some reassurance that the thresholds are not purely an artefact of one interface. It does not establish that the same thresholds or correction times will apply everywhere. Tool design, keyboard shortcuts, annotator experience, object density, and internal quality standards can all change the cost of repair.

CAR is therefore most valuable as a method that an organisation can calibrate locally, rather than as a universal table of correction prices.

The highest CAR is 0.93, but the ranking is the more transferable result

Across the reported CAR benchmark, DSVT performs strongly:

Dataset	DSVT AP	DSVT CAR
Zenseact	0.94	0.93
nuScenes	0.87	0.87
KITTI	0.76	0.83

The highest observed CAR is 0.93 on Zenseact. Under the paper’s timing assumptions and manual-annotation baseline, the authors interpret this as an estimated reduction of up to 93% in annotation effort.

Zenseact is particularly relevant because its LiDAR configuration and environmental characteristics resemble the intended DARTS operating domain. The DARTS annotation generator therefore starts with a DSVT model pretrained on nuScenes and fine-tuned using Zenseact data.

It is tempting to treat 93% as the paper’s single conclusion. That would be convenient, memorable, and slightly careless.

The more transferable result is that model rankings change when errors are weighted by repair cost. The exact CAR value will depend on the dataset, object class, distance range, annotation tool, annotators, and baseline assumptions. The ranking divergence demonstrates the broader principle without requiring the 0.93 figure to generalise intact.

In other words, the important question is not whether every annotation operation can remove 93% of its labour. It is whether the organisation is currently selecting models using a metric that ignores the labour it wants to reduce.

The pipeline makes correction-time economics operational

CAR would be little more than an interesting spreadsheet if the surrounding system could not preserve predictions, corrections, provenance, and approval states.

The DARTS pipeline provides that operational layer.

Its workflow can be summarised as:

Raw multisensor recordings
        ↓
Integrity, synchronisation, and format checks
        ↓
Standardised DARTS dataset representation
        ↓
Automated detection, tracking, and anonymisation
        ↓
Human correction in Segments.ai
        ↓
Manager review and approval
        ↓
Versioned annotations stored for release and retraining

Apache Airflow orchestrates the workflow through directed acyclic graphs. Individual services run in Docker containers, allowing preprocessing, annotation generation, anonymisation, imports, exports, and approvals to be executed as modular tasks.

A PostgreSQL database serves as the system’s source of truth. It records scene metadata, annotation versions, origins, and validation states. Files, processed datasets, model checkpoints, and workflow artefacts are stored in a structured file system.

This architecture matters because human-in-the-loop improvement requires more than placing a model before an annotator. The system must know:

which model created an annotation;
what the human changed;
who approved it;
which corrected examples should enter later training;
whether a newer model actually reduced expensive corrections.

Without provenance and versioning, correction work disappears into a finished dataset. With them, each correction becomes evidence about where the automation is still costly.

The modules solve different operational failure points

The paper describes several modules, but their value is clearer when grouped by the operational problems they address.

Reduce the initial correction burden

The Annotation Generator produces preliminary 3D object annotations from LiDAR point clouds. It uses a modified OpenPCDet framework and a DSVT detector adapted to the intended domain.

The Multiple Object Tracking module then improves continuity across frames. It uses Kalman-filter outputs to refine object positions, applies size-consistency corrections, adds stationary-object handling, and smooths trajectories using polynomial fitting.

These components aim to give annotators a more coherent first draft. A frame-by-frame detector may identify the same vehicle repeatedly without preserving a stable identity or trajectory. That is technically a detection output, but it is a poor starting point for temporal annotation.

Prevent invalid data from entering the workflow

The preprocessing module checks whether required files exist, whether sensor timestamps align, and whether data remain consistent through checksum and log analysis.

When validation fails, the system generates a PDF report and emails administrators.

This is not glamorous AI work. It is also the sort of work that determines whether the expensive AI work can be trusted. A highly accurate detector applied to misaligned sensor data merely produces sophisticated confusion at scale.

Preserve privacy before external review

The anonymisation module detects faces using RetinaFace and licence plates using a fine-tuned YOLO model. Detected regions can be recorded as annotations or blurred before data leave the internal system.

For a national driving dataset, privacy handling is part of the production workflow, not a final publishing chore. Once images have been copied into external review systems, delayed anonymisation becomes a governance problem rather than a preprocessing task.

Turn corrections into reusable evidence

The Segments Toolkit moves datasets and generated annotations into the external annotation interface, then retrieves reviewed outputs while preserving information about their origin and the person who corrected them.

DARTS Utils compares annotation versions, visualises point clouds and trajectories, and calculates AP and CAR.

This closes the loop. The system does not merely produce labels; it records how much intervention those labels required.

What the evidence directly supports—and what businesses can infer

The paper combines a metric contribution, a comparative detector benchmark, and an implemented annotation architecture. These support different conclusions.

Paper component	What it directly supports	Cognaptus business interpretation	Boundary
Manual correction-time study	Error types impose materially different repair costs	Measure the local cost of errors before optimising automation	Reported times may change with tools, teams, and task definitions
CAR detector benchmark	AP and estimated correction-time savings can rank models differently	Select models using workflow economics alongside accuracy	CAR results cover the vehicle class and selected distance ranges
Zenseact CAR of 0.93	DSVT pre-annotations can provide large estimated savings on a similar public dataset	Similar-domain benchmarks can support staffing and ROI scenarios	The value is not a direct full-production measurement on DARTS
Airflow, Docker, PostgreSQL, and Segments.ai integration	A traceable human-in-the-loop pipeline is technically feasible	Treat annotation as a managed production process	The paper does not isolate the ROI contribution of each module
Correction provenance and retraining loop	Corrected labels can be retained for later model improvement	Use expensive corrections to prioritise retraining	The paper does not report a longitudinal retraining experiment

The business inference is straightforward: once correction work is measured by type, the organisation can optimise the entire workflow around expensive mistakes.

A team might accept more cheap false positives if that trade-off reduces costly false negatives. It might prioritise retraining on scenarios that repeatedly force annotators to create boxes from scratch. It might redesign the annotation interface if a theoretically minor error consumes disproportionate time.

These actions extend beyond choosing the strongest detector. They turn annotation logs into an operational dataset about human effort.

That broader use is a Cognaptus inference, not an experiment reported in the paper. The paper supplies the measurement logic and supporting infrastructure. Each organisation must still decide how to convert those measurements into model thresholds, training priorities, vendor requirements, and staffing plans.

CAR can travel beyond autonomous driving, but its numbers cannot

The CAR framework applies naturally to other human-reviewed automation systems.

Consider document extraction. Correcting one mistyped invoice field may take seconds, while discovering that an entire invoice was omitted may require reopening the source, creating a record, and checking every field.

In content moderation, reversing an incorrect category may be cheap. Investigating an unflagged high-risk item may be considerably more expensive.

In customer-support automation, editing an awkward sentence is different from identifying that the system failed to create a required escalation.

The transferable method is:

Define meaningful error types.
Measure how long people take to repair each type.
Count the errors produced by each candidate system.
Compare total repair cost with the manual baseline.
Recalibrate as tools, staff, and operating conditions change.

The reported correction times and CAR scores do not transfer with the method. A 23-second false negative in a LiDAR tool says little about the cost of a missed clause in a legal document. What transfers is the refusal to treat all errors as economically equal.

Where the 93% figure stops

The paper provides a strong operational framing, but several boundaries matter when using its results.

First, CAR is an estimated labour-saving metric. It combines detector errors with average measured correction times. It is not a direct stopwatch measurement of the entire DARTS production process, including inference, data transfer, infrastructure operation, managerial review, and final quality control.

Second, the CAR benchmark reports results for the car or vehicle class. The timing study includes cars, pedestrians, and cyclists, but the detector-comparison table does not establish equivalent CAR performance across all object classes.

Third, CAR is not reported for Waymo, although AP results are. The KITTI CAR evaluation also uses a distance range of 20 to 50 metres, while Zenseact and nuScenes use zero to 50 metres. CAR values across datasets should therefore not be read as a perfectly controlled cross-dataset ranking.

Fourth, AP protocols differ across benchmarks, with Zenseact evaluated using the nuScenes protocol. Within-dataset comparisons remain informative, but comparisons of absolute AP values across datasets require restraint.

Finally, the authors expect the strong Zenseact result to translate to DARTS because the domains are similar. That is a reasonable projection, especially given the model’s domain-adaptation path. It remains a projection until the pipeline’s correction savings are measured on DARTS data at production scale.

The paper also does not evaluate downstream perception-model quality after using the resulting annotations. Its focus is annotation infrastructure and efficiency, not whether the completed DARTS dataset ultimately produces safer or more accurate autonomous-driving systems.

Fewer clicks is a model-selection principle

Autonomous-driving datasets are often described in kilometres, scenes, sensors, and object counts. Their production cost is eventually paid in human actions.

The DARTS paper makes those actions visible.

Its most useful contribution is not the claim that automation removes annotators from the process. The pipeline keeps humans at the point where judgment and approval remain necessary. Instead, it asks whether the machine has reduced the amount and difficulty of the work reaching them.

That question changes how models should be selected. It changes which errors deserve retraining attention. It changes what an annotation platform should record. It also changes how managers should calculate the value of an AI assistant.

A detector’s job inside a human-in-the-loop system is not simply to maximise its benchmark score. Its job is to leave behind the cheapest possible set of corrections without compromising the required label quality.

That is why autonomous driving needs fewer clicks, not smarter annotators.

Cognaptus: Automate the Present, Incubate the Future.

Andrii Gamalii, Daniel Górniak, Robert Nowak, Bartłomiej Olber, Krystian Radlak, and Jakub Winter, “Semi-Automated Data Annotation in Multisensor Datasets for Autonomous Vehicle Testing,” arXiv:2512.24896, https://arxiv.org/abs/2512.24896. ↩︎

The leaderboard changes when errors are priced in seconds#

A missed vehicle is not equivalent to an extra box#

CAR converts model errors into estimated annotation labour#

The manual timing study is the foundation, not a side experiment#

The highest CAR is 0.93, but the ranking is the more transferable result#

The pipeline makes correction-time economics operational#

The modules solve different operational failure points#

Reduce the initial correction burden#

Prevent invalid data from entering the workflow#

Preserve privacy before external review#

Turn corrections into reusable evidence#

What the evidence directly supports—and what businesses can infer#

CAR can travel beyond autonomous driving, but its numbers cannot#

Where the 93% figure stops#

Fewer clicks is a model-selection principle#