Audit the Bots: When AI Judges the Work of Other AI

A bot finishes a task on a computer. It says the file was downloaded, the form was submitted, the setting was changed, or the report was edited.

Now comes the awkward part.

Do we believe it?

For traditional automation, the answer was usually procedural. Check a database field. Inspect a log. Verify an API response. Confirm that a rule fired. Robotic process automation was brittle, yes, but at least its brittleness often left a trail. The machine followed a script; the script touched known systems; the success condition could usually be hard-coded by someone patient enough to suffer through enterprise software.

Computer-use agents change that bargain. They operate graphical interfaces directly. They read natural-language instructions, look at screenshots, click buttons, type text, scroll windows, and navigate ordinary software more like a person than a script. This makes them flexible. It also makes them annoyingly difficult to evaluate.

The paper “CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents” studies a tempting solution: use another AI system as the auditor.¹ A vision-language model sees the instruction and the final GUI state, then judges whether the task was completed. Neat. Scalable. Almost suspiciously convenient.

The paper’s real contribution is not simply that VLM auditors can work. The more useful finding is that audit reliability has at least three separate dimensions: accuracy, confidence calibration, and inter-model agreement. A model may often be right, yet still be too confident when it is wrong. Two strong models may both look impressive in aggregate, yet disagree when the interface becomes ambiguous. That is not a minor measurement detail. In production, those differences determine whether the automation system proceeds, asks for human confirmation, or quietly creates a mess for someone else to discover later.

The paper does not test an agent. It tests the judge.

The study evaluates five vision-language models as auditors of computer-use agents. The auditor is not asked to perform the task. It is asked to inspect the outcome.

For each task instance, the auditor receives two pieces of evidence:

Evidence given to auditor	Meaning
Natural-language instruction	What the computer-use agent was supposed to do
Final GUI screenshot	The observable state after the agent acted

From that, the auditor outputs:

Auditor output	Operational meaning
Binary judgment	Done or not done
Confidence score	How strongly the model believes the task was completed

The study compares the auditor’s judgment against benchmark-provided task success labels. It does this across three computer-use benchmarks: macOSWorld, Windows Agent Arena, and OSWorld. These cover macOS, Windows, and Linux-style desktop environments, with different applications and task patterns.

The five auditors are:

Category	Model
Proprietary	GPT-4o
Proprietary	Claude 3.5 Sonnet
Open-source	InternVL-2-8B
Open-source	LLaVA-v1.5-7B
Open-source	Qwen2-VL-7B

This is why the paper is best read evidence-first. There is no new grand architecture to admire from across the room. The value is in how the evidence decomposes the word “reliable.” In AI deployment, that word is frequently used as if it were a single knob. It is not. It is a cabinet full of knobs, and some are mislabeled.

Accuracy says whether the auditor is often right

The first result is the obvious one: how often does the auditor match the benchmark label?

Auditor	macOSWorld	Windows Agent Arena	OSWorld
GPT-4o	0.91	0.71	0.77
Claude 3.5 Sonnet	0.89	0.75	0.79
InternVL-2-8B	0.85	0.69	0.72
LLaVA-v1.5-7B	0.82	0.66	0.68
Qwen2-VL-7B	0.87	0.68	0.73

The proprietary models perform best overall. GPT-4o leads on macOSWorld, while Claude 3.5 Sonnet leads on Windows Agent Arena and OSWorld. Among the open-source auditors, InternVL-2-8B and Qwen2-VL-7B are competitive relative to LLaVA-v1.5-7B, though they still trail the proprietary models in most settings.

But the more important pattern is not the model ranking. It is the environmental drop.

All auditors perform best on macOSWorld. Their accuracy falls on Windows Agent Arena and OSWorld. GPT-4o moves from 0.91 on macOSWorld to 0.71 on Windows Agent Arena. Claude moves from 0.89 to 0.75. Qwen2-VL-7B moves from 0.87 to 0.68. That is not a rounding error. That is the interface reminding us that “desktop automation” is not one environment. It is a collection of visual conventions, application states, hidden actions, and task-specific cues.

For business readers, the lesson is direct: do not certify an AI auditor using a single aggregate score. A customer-support desktop, a finance operations workstation, a procurement portal, and a compliance workflow may all be “computer use,” but they are not the same audit problem. An auditor that looks strong in one environment may become merely tolerable in another.

The paper’s accuracy results should therefore be treated as main evidence for environment-specific audit reliability. They support the claim that VLM auditing is feasible. They do not support the lazy conclusion that one auditor can be dropped into any GUI workflow and trusted equally everywhere. That would be convenient, and convenience has a long history of being wrong.

Calibration says whether the auditor knows when to hesitate

Accuracy answers a simple question: was the auditor right?

Calibration asks a subtler question: when the auditor says it is confident, should we believe the confidence?

The paper measures calibration using the Brier score, where lower is better. In plain language, the Brier score penalizes confidence that does not match reality. A model that says “90% sure” and is correct most of the time at that confidence level is better calibrated than a model that tosses around 90% confidence like office stationery.

Auditor	macOSWorld	Windows Agent Arena	OSWorld
GPT-4o	0.058 ± 0.003	0.091 ± 0.006	0.074 ± 0.004
Claude 3.5 Sonnet	0.063 ± 0.004	0.099 ± 0.007	0.081 ± 0.005
InternVL-2-8B	0.097 ± 0.007	0.142 ± 0.010	0.118 ± 0.008
LLaVA-v1.5-7B	0.112 ± 0.008	0.159 ± 0.012	0.134 ± 0.009
Qwen2-VL-7B	0.105 ± 0.008	0.167 ± 0.011	0.141 ± 0.010

Here the distinction becomes operationally important. GPT-4o and Claude 3.5 Sonnet have substantially lower Brier scores than the open-source models across all three benchmarks. Their confidence estimates are not merely decoration; they are more aligned with observed correctness.

The open-source models are weaker not only because they are less accurate, but because their confidence is less dependable. That matters because in a deployed automation system, confidence is rarely just reported. It is used.

A production system might use auditor confidence to decide:

Confidence behavior	Downstream action
High confidence, models agree	Accept completion and move on
Medium confidence	Ask the user for confirmation
Low confidence	Re-run the task, collect more evidence, or escalate
High confidence, later proven wrong	The most dangerous case, because the system fails with confidence

This is where the common misconception breaks down. A “good judge” is not just a model with high accuracy. A judge must also know when the evidence is thin. An overconfident auditor is operationally worse than a modest one, because it suppresses the very escalation signals that protect the workflow.

In a business process, uncertainty is not a philosophical inconvenience. It is a routing variable.

If the auditor is uncertain, route the case to a human. If auditors disagree, collect more evidence. If the environment is known to degrade audit accuracy, lower the automation threshold. This is not glamorous. It is just how dependable systems are built: less magic, more plumbing.

Agreement says whether “done” is visually obvious

The third evidence layer is inter-model agreement. The paper uses Cohen’s $\kappa$ to measure how consistently pairs of auditors make the same binary judgment after accounting for chance agreement. Higher values indicate stronger agreement.

Selected results:

Auditor pair	macOSWorld	Windows Agent Arena	OSWorld
GPT-4o / Claude 3.5 Sonnet	0.76	0.66	0.71
GPT-4o / InternVL-2-8B	0.64	0.57	0.61
GPT-4o / LLaVA-v1.5-7B	0.61	0.54	0.59
GPT-4o / Qwen2-VL-7B	0.66	0.58	0.63
Claude 3.5 Sonnet / Qwen2-VL-7B	0.69	0.61	0.60
InternVL-2-8B / Qwen2-VL-7B	0.68	0.60	0.65
LLaVA-v1.5-7B / Qwen2-VL-7B	0.64	0.67	0.61

The strongest agreement appears between GPT-4o and Claude 3.5 Sonnet, but even there, agreement declines in Windows Agent Arena. Proprietary-open-source agreement is lower. Open-source agreement is moderate. Again, the more heterogeneous environments create more disagreement.

This is not simply model rivalry. It points to a deeper observability problem.

Some tasks cannot be verified from the final screenshot alone. A file may appear saved while a background sync failed. A setting may look changed but not applied. A form may be filled but not submitted. A download may be visible in the browser but incomplete. The final screen is evidence, not omniscience.

Different auditors may resolve that missing evidence differently. One may infer success from visible UI cues. Another may demand stronger confirmation. A third may overread the instruction. When this happens, disagreement is not just noise. It is a diagnostic signal that the task outcome is under-observed.

That signal is especially valuable in enterprise automation. The correct response to auditor disagreement is not necessarily “pick the strongest model.” Sometimes the correct response is “this task needs a better success artifact.”

A better audit design might require:

Weak evidence pattern	Better evidence to collect
Final screenshot only	Intermediate screenshots or action trace
Visual completion cue	File checksum, database value, or exported artifact
Hidden background operation	System log or process status
Ambiguous instruction	Structured task intent and expected output
Multi-step workflow	Step-level verification checkpoints

This is the paper’s most business-relevant insight. VLM auditors are not just cheaper substitutes for manual review. Used properly, they can reveal where the workflow itself lacks verifiable evidence.

The evidence points to an audit architecture, not a single auditor

The paper does not propose a full enterprise audit stack. That is not its job. But its results imply one.

A practical CUA audit pipeline should not ask only: “Which VLM is most accurate?” That is procurement thinking wearing a lab coat. The better question is: How should auditor signals be combined with uncertainty, disagreement, and environment-specific risk?

One possible framework:

Layer	What it checks	Business purpose
Primary auditor	Predicts done / not done from instruction and final state	Fast scalable evaluation
Calibration gate	Interprets confidence reliability by environment and model	Prevents overconfident automation
Agreement gate	Compares multiple auditors or model families	Detects ambiguity and under-observation
Evidence collector	Adds logs, intermediate states, artifacts, or structured checks	Reduces visual guesswork
Human escalation	Reviews cases with low confidence, disagreement, or high impact	Controls operational risk

This architecture treats AI auditing as a decision system, not a model leaderboard. That distinction matters.

In low-risk workflows, a single calibrated auditor may be enough. For example, checking whether a browser page visibly reached a public confirmation screen may not require a committee of digital philosophers. In higher-risk workflows, such as payments, legal filings, HR changes, or regulated reporting, final-screen judgment should be only one piece of evidence.

The business value, then, is not “replace human QA entirely.” That is the sort of sentence that makes implementation teams tired before the meeting starts. The more realistic value is:

reduce routine manual inspection;
identify ambiguous cases automatically;
create escalation queues based on confidence and disagreement;
improve benchmark and workflow design by exposing unverifiable tasks;
make CUA deployment measurable instead of vibes-based.

A VLM auditor can lower the cost of diagnosis. It cannot magically make hidden state visible.

What the paper directly shows, and what we should infer carefully

The study’s evidence can be read in three layers.

Paper result	What it directly shows	Business interpretation	Boundary
VLMs can predict task completion across three CUA benchmarks	Model-based auditing is feasible using instruction plus final GUI state	AI auditors can reduce manual QA load for some desktop automation tasks	Benchmark labels are not the same as live enterprise ground truth
Accuracy drops in Windows Agent Arena and OSWorld	Audit difficulty is environment-dependent	Auditors need environment-specific validation before deployment	The paper does not isolate every cause of degradation
Brier scores differ across models	Confidence reliability is a separate dimension from accuracy	Confidence can drive escalation, but only if calibrated	Confidence is self-reported through prompting, not uniformly derived from model log probabilities
Cohen’s $\kappa$ varies across model pairs and benchmarks	Auditors can disagree substantially	Disagreement can flag ambiguity and insufficient observability	Disagreement does not automatically identify which auditor is correct
Auditors see final state only	Scalable evaluation is possible with minimal evidence	Useful for lightweight audit layers	Not enough for hidden side effects, safety, privacy, or policy compliance

This table is the practical center of the article. The paper is not saying “VLM auditors are unreliable.” It is saying they are conditionally useful and that the conditions must be measured.

For automation teams, the relevant question is no longer whether AI can judge AI. It can, sometimes, fairly well. The better question is whether the judgment is strong enough for the consequence attached to it.

A failed calendar update and a failed wire transfer are both task failures. They do not deserve the same audit threshold. Apparently, civilization still requires context. Tragic, but manageable.

The final screenshot is a useful shortcut, not a complete audit trail

The paper’s main limitation is also the key design trade-off: auditors observe only the instruction and the final GUI state.

That makes the method scalable. It also makes it blind to anything not visible at the end. The authors explicitly note that many computer-use tasks involve hidden system state, background effects, or transient interface changes. This matters because GUI automation often produces outputs whose real success condition exists outside the screen.

A screenshot can show that a button was clicked. It may not show whether the server accepted the request.

A screenshot can show that a file is open. It may not show whether the latest version was saved.

A screenshot can show that a field contains text. It may not show whether the form passed validation after submission.

The calibration analysis has its own boundary. The study evaluates model-reported confidence elicited through standardized prompting. That is practically useful, because product teams often have access to reported confidence-like outputs or can ask models to express uncertainty. But it is not the same as intrinsic probabilistic calibration from token-level probabilities across all models.

Finally, the study focuses on binary task completion. It does not evaluate safety, privacy, policy compliance, harmful side effects, reversibility, user consent, or whether the agent completed the task in an acceptable way. A task can be completed and still be unacceptable. Anyone who has watched automation “solve” a problem by deleting the evidence will understand the distinction.

The real bottleneck is not action. It is trustworthy verification.

Computer-use agents promise a broader form of automation because they interact with software visually rather than through carefully maintained scripts. That is powerful precisely because it bypasses many integration barriers. But the same flexibility makes evaluation harder.

CUAAudit shows that VLMs can serve as scalable auditors of these agents across real desktop benchmarks. The strongest models reach high accuracy in easier environments and maintain better calibration than open-source alternatives. That is encouraging.

But the paper’s more important message is colder: audit outputs are uncertain signals, not verdicts from a machine oracle. Accuracy falls under environment shift. Confidence can mislead. Strong auditors disagree. Final screenshots omit hidden state. And binary completion is only one slice of responsible deployment.

For business automation, the implication is straightforward. Do not build CUA governance around a single “AI judge.” Build it around a measured audit process: environment-specific validation, calibrated confidence thresholds, disagreement-based escalation, richer evidence collection, and human review where the consequence justifies it.

The future of AI automation will not be secured by agents that confidently say they are done.

It will be secured by systems that know when “done” is still an open question.

Cognaptus: Automate the Present, Incubate the Future.

Marta Sumyk and Oleksandr Kosovan, “CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents,” arXiv:2603.10577v2, 2026. https://arxiv.org/abs/2603.10577 ↩︎

The paper does not test an agent. It tests the judge.#

Accuracy says whether the auditor is often right#

Calibration says whether the auditor knows when to hesitate#

Agreement says whether “done” is visually obvious#

The evidence points to an audit architecture, not a single auditor#

What the paper directly shows, and what we should infer carefully#

The final screenshot is a useful shortcut, not a complete audit trail#

The real bottleneck is not action. It is trustworthy verification.#