Themis Knows Best: When AI Judges Start Training Other AI

Click.

The button moved. The page refreshed. A popup appeared, then disappeared. The agent says the task is done. The screenshot looks plausible. The log is long enough to impress a project manager and confusing enough to defeat a reviewer with a normal human attention span.

Now comes the awkward question: should the agent be rewarded?

That question sounds small, but it is quickly becoming one of the harder bottlenecks in computer-use AI. A GUI agent can be trained to click, type, scroll, search, submit, and recover from mistakes. But reinforcement learning only works if the reward signal tells the truth often enough. Reward the wrong trajectory, and the model learns the wrong habit with statistical confidence. Very sophisticated nonsense, in other words.

The paper behind OS-Themis tackles this exact problem: how to build scalable reward critics for GUI agents without hand-coding brittle rules for every app, website, and operating system.¹ Its answer is not “use a bigger judge model.” That would be too easy, and therefore suspicious. The paper’s more useful idea is that GUI rewards need an evidence structure: identify task-critical milestones, verify observable state changes, audit the evidence chain, and only then issue the final binary reward.

That mechanism is the real story. The benchmark gains matter, but they matter because they show that evidence engineering can change the quality of learning signals.

The naive judge sees a long movie and gives a thumbs-up

A GUI trajectory is not a clean mathematical proof. It is a noisy movie: screenshots, internal thoughts, low-level actions, UI transitions, occasional error messages, accidental recoveries, and final-state ambiguity.

Existing GUI reward approaches tend to fall into three broad camps.

Reward approach	What it does well	Where it breaks
Rule-based rewards	High precision when the environment exposes reliable success checks	Poor scalability across apps, websites, and OS interfaces; vulnerable to reward hacking if rules are too narrow
Trained critic models	Can learn domain-specific judgment from feedback or expert data	Expensive data construction; weaker generalization to new platforms or task distributions
LLM/VLM-as-a-judge	Scalable and flexible; can reason over screenshots and instructions	Easily loses decisive evidence in long trajectories or over-accepts plausible-looking partial success

The tempting solution is to ask a strong vision-language model to inspect the whole trajectory and decide whether the task succeeded. Unfortunately, long GUI tasks create two opposite failure modes.

Sparse evaluation, such as checking only the final screenshot, loses context. A final page might look correct even if the agent skipped a required condition, submitted the wrong value earlier, or arrived at the state by accident.

Full-trajectory evaluation creates a different problem: evidence dilution. The judge sees many harmless or locally correct actions, and these “minor wins” can mask a single outcome-determining failure. The model remembers that the agent did many reasonable things and politely forgets that it did not complete the task. AI evaluators, like junior analysts, are vulnerable to long documents that contain one fatal footnote.

OS-Themis is designed around this observation. The issue is not merely judgment. The issue is deciding what counts as evidence.

OS-Themis turns trajectory judgment into an audited evidence workflow

OS-Themis divides reward judgment into two modules and four agent roles. The division matters because each role handles a different cognitive burden.

Module	Agent	Actual role in the paper	Operational consequence
Milestone Verification Module	Selector	Selects key trajectory steps and assigns concrete observable assessment goals	Reduces evidence dilution by focusing evaluation on task-critical transitions
Milestone Verification Module	Verifier	Checks whether each selected milestone was achieved using local before/after evidence	Grounds judgment in visible UI state changes, not in the agent’s self-description
Verdict Calibration Module	Reviewer	Audits whether milestones are complete, rigorous, and evidence-grounded	Raises missing checks, weak criteria, and overlooked failure modes
Verdict Calibration Module	Judge	Produces the final binary reward using the full deliberation history	Converts structured evidence into an outcome-level reward without blindly aggregating milestones

This is not bureaucracy for its own sake. It is controlled friction.

The Selector does not ask, “Did the whole task succeed?” It asks a narrower question: which steps are necessary to know whether the task succeeded? For a photo-taking task, that might include reaching the camera preview and actually capturing the photo. For a settings task, it might include reaching the right menu, changing the correct toggle, and confirming that the new state persists.

The Verifier then checks those milestones locally. It compares the relevant state before and after an action and looks for observable proof: text changed, a setting switched, an item disappeared, a dialog confirmed, a page header moved to the expected screen. Crucially, the verifier is instructed not to treat the agent’s reasoning as proof. The pixels matter. The agent’s charming internal monologue does not.

The Reviewer plays the uncomfortable but necessary role of critic. It asks whether the selected milestones are sufficient, whether the criteria are too lenient, whether a final-state persistence check is missing, or whether the evidence relies too much on action descriptions rather than visible UI state. This is where OS-Themis becomes more than step checking. It audits the evidence chain before reward assignment.

Finally, the Judge produces the binary reward. The paper is explicit that the Judge does not simply count verified milestones. That is important. A task can contain failed intermediate actions that are later corrected. Conversely, a trajectory can pass several intermediate checks and still miss the actual final requirement. Outcome reward is not the same as step-level neatness.

The mechanism can be summarized simply:

Raw GUI trajectory
  -> select task-critical milestones
  -> verify each milestone against observable UI evidence
  -> review whether the evidence chain is complete and rigorous
  -> judge final task success from the full deliberation history
  -> return binary reward for training or filtering

The difference from a normal LLM judge is therefore not cosmetic. OS-Themis makes the judge work from curated, reviewed evidence rather than from an undifferentiated trajectory blob.

The paper’s central bet is precision, not maximum coverage

A casual reader may assume the goal of a reward critic is to recognize as many successful trajectories as possible. The OS-Themis paper takes a sharper position: in reinforcement learning, false positives are especially dangerous.

If a failed trajectory is mistakenly rewarded, the policy update is contaminated. The model is not merely wasting a training example; it is being pushed toward behavior that should have been discouraged. Missing some successful trajectories is also costly, because learning becomes slower and positive signal becomes sparse. But once recall is adequate, the paper argues that improving precision can matter more than maximizing recall.

The appendix formalizes this with a simplified policy-gradient argument. Let $\rho$ represent recall and $\alpha$ represent the false-positive rate. The useful learning signal depends on the preference margin:

$$ \rho - \alpha $$

The intuition is clean: the evaluator must distinguish successful trajectories from failed ones. If false positives rise, bad trajectories receive reward and the gap between good and bad behavior shrinks. When the gap shrinks far enough, the reward becomes a noisy cheerleader instead of a training signal.

This explains why OS-Themis is intentionally conservative in several places. The Reviewer is configured as a critic rather than merely an advisor. Assignment goals are made concrete so the Verifier cannot approve vague progress. Test-time voting can be tuned toward precision or recall depending on use case. The system is not trying to be universally generous. It is trying to avoid teaching the agent that “almost right” deserves applause.

That design choice is strategically important for business users. In enterprise automation, the expensive failures are often false positives: the workflow appears complete, the system records success, and nobody notices until a customer complains, a compliance field is wrong, or a downstream process breaks. Missed positives are inefficient. False positives are operational debt with a friendly status icon.

Milestones are the paper’s practical compression layer

The milestone idea is not just a conceptual convenience. The appendix gives useful scale evidence.

Across OGRBench, the paper reports 1,409 tasks, 27,882 total trajectory steps, and 9,918 selected milestones under Qwen3-VL-235B. That means milestones account for 35.57% of all steps at the step level. The average task has 7.04 milestones, with a median of 6.00. The task-level average milestone percentage is higher, 58.47%, because shorter tasks naturally have fewer irrelevant steps.

Milestone statistic	Reported value	Interpretation
Total tasks	1,409	Same scale as OGRBench trajectory count
Total trajectory steps	27,882	Raw evaluation context is large
Total milestones	9,918	Selected evidence is much smaller than full trace
Step-level milestone ratio	35.57%	Roughly one-third of steps carry most evaluative value
Average milestones per task	7.04	Evaluation becomes a structured checklist, not a whole-movie reaction

This is the operational insight: OS-Themis does not simply add more evaluation. It selectively spends evaluation on the steps most likely to determine the outcome.

That matters because GUI trajectories are filled with low-value actions: scrolling, navigating, opening menus, dismissing popups, moving between pages, and recovering from harmless errors. Evaluating every action creates cost and noise. Evaluating only the end state creates blind spots. Milestones give the critic a middle path: enough context to avoid final-frame hallucination, enough compression to avoid drowning in irrelevant local successes.

For businesses, this resembles process audit design. You do not audit every keystroke in an invoice workflow. You audit the control points: vendor identity, invoice amount, approval status, tax code, payment destination, and final submission state. OS-Themis applies the same logic to GUI agents, except the auditor is a vision-language critic and the evidence is screenshots.

OGRBench tests whether the critic generalizes across GUI worlds

The authors also introduce OmniGUIRewardBench, or OGRBench, because GUI reward evaluation itself lacks a broad benchmark. The dataset contains 1,409 trajectories: 700 positive and 709 negative. It spans Ubuntu, Android, Windows, macOS, and Web tasks, drawing from OSWorld, AndroidWorld, WindowsAgentArena, macOSArena, and WebArena-Lite-v2.

The benchmark is not just a leaderboard accessory. It is part of the argument. If a reward critic only works on one interface domain, it may be useful, but it is not a generalist GUI reward framework.

Platform source	Positive	Negative	Note
OSWorld / Ubuntu	393	348	Largest component
AndroidWorld / Android	98	90	Used again for RL experiments
WindowsAgentArena / Windows	94	119	Desktop coverage
macOSArena / macOS	16	61	Imbalanced because current agents perform poorly on macOS tasks
WebArena / Web	99	91	Browser-based tasks
Total	700	709	Overall class balance is close to even

On OGRBench, OS-Themis improves over DigiRL and ZeroGUI across average metrics. The authors report that, averaged across tested base models, OS-Themis reaches 81.6% accuracy, 90.9% precision, 70.4% recall, and 78.7% F1. ZeroGUI reaches 73.9% accuracy, 85.8% precision, 57.4% recall, and 65.3% F1. DigiRL is much weaker at 62.8% accuracy, 61.3% precision, 53.5% recall, and 52.5% F1.

Framework average on OGRBench	Accuracy	Precision	Recall	F1
DigiRL	62.8	61.3	53.5	52.5
ZeroGUI	73.9	85.8	57.4	65.3
OS-Themis	81.6	90.9	70.4	78.7

The best headline number is the Qwen3-VL-235B configuration under OS-Themis: 88.0% accuracy, 92.8% precision, 82.3% recall, and 87.2% F1. But the average table is more informative for interpretation. It suggests the framework improves the operating point of different evaluator backbones, not just one carefully chosen model.

There is also a useful negative detail. Proprietary and open models behave differently under these frameworks. Some models in the table are very conservative, producing low recall because they prefer predicting failure. Gemini-3-Flash shows strong precision. Qwen3-VL models benefit more as model size increases. The paper’s own discussion suggests that stronger reasoning models can exploit OS-Themis’s structured evidence better.

So the result is not “framework beats model quality.” It is more precise: framework design and model capability interact. Better evidence organization gives capable models something better to reason over.

The online RL tests show reward quality becoming model performance

A benchmark critic is nice. A critic that improves training is more interesting.

The paper tests OS-Themis as the reward source for online reinforcement learning on AndroidWorld. The authors train Qwen3-VL-4B and Qwen3-VL-8B policy backbones using different reward sources, including SEAgent, ZeroGUI, and OS-Themis. Each row is trained independently from the same initialization for the given backbone.

Policy backbone	Reward source	AndroidWorld accuracy
Qwen3-VL-4B	Baseline, no RL reward source	45.3
Qwen3-VL-4B	SEAgent	47.8
Qwen3-VL-4B	ZeroGUI with Qwen3-VL-235B	46.1
Qwen3-VL-4B	OS-Themis with Qwen3-VL-8B	50.9
Qwen3-VL-4B	OS-Themis with Qwen3-VL-235B	51.3
Qwen3-VL-8B	Baseline, no RL reward source	47.6
Qwen3-VL-8B	SEAgent	50.0
Qwen3-VL-8B	ZeroGUI with Qwen3-VL-235B	51.7
Qwen3-VL-8B	OS-Themis with Qwen3-VL-8B	53.4
Qwen3-VL-8B	OS-Themis with Qwen3-VL-235B	54.7

The interpretation is straightforward but important. For Qwen3-VL-4B, OS-Themis reaches 51.3% versus a 45.3% baseline, an absolute gain of 6.0 points. For Qwen3-VL-8B, it reaches 54.7% versus 47.6%, a 7.1-point gain. The larger policy benefits more.

This does not prove that OS-Themis is the final answer to GUI RL. It does show that reward-critic quality can translate into downstream agent performance, not merely prettier evaluator metrics.

The scaling pilot pushes the point further. Using Qwen3-VL-235B inside OS-Themis to score trajectories, the authors instantiate 1,024 training tasks and validate across staged training scales. After scaling to 1,024 tasks, Qwen3-VL-4B reaches 55.6% on AndroidWorld, a 10.3-point improvement over baseline.

That result should be read as main evidence for feasibility, not as a complete scaling law. The paper itself says the scaling characterization remains constrained by infrastructure. Still, the direction is useful: once reward signals become sufficiently reliable, the training loop can absorb more self-generated interaction data without immediately collapsing into noise.

The ablations show why this is not just “four agents are better than one”

Multi-agent frameworks can become decorative very quickly. Add enough role names and any pipeline starts to look intelligent. The ablation studies are therefore important because they test whether the roles do real work.

The full OS-Themis system reaches 88.0% accuracy, 92.8% precision, and 82.3% recall in the reported ablation setting. Removing components changes the failure profile.

Variant	Accuracy	Precision	Recall	What the result suggests
Full OS-Themis	88.0	92.8	82.3	Best precision with strong overall performance
Without Selector	83.3	79.7	88.9	Checking every step increases recall but dilutes decisive evidence and hurts precision
Without Verifier	81.9	77.2	90.1	Assuming selected milestones are correct makes the system too lenient
Without Reviewer	86.9	85.7	88.4	Less strict auditing raises recall but admits more false positives
Without Judge	52.5	89.7	5.0	Pure milestone correctness becomes absurdly conservative; outcome-level reasoning is necessary

The Selector ablation is especially revealing. Removing the Selector does not mean less evidence. It means more evidence: every step is forwarded to the Verifier. Performance gets worse, especially precision. That supports the paper’s evidence-dilution thesis. The problem is not insufficient checking; it is checking the wrong things with equal seriousness.

The Verifier ablation shows the opposite danger. If selected milestones are treated as correct by default, the system becomes too trusting. Precision falls. Selection without grounded verification is just a fancy checklist filled in by optimism.

The Reviewer ablation clarifies the paper’s preference for strictness. Without the Reviewer, recall rises to 88.4 but precision falls to 85.7. With the Reviewer acting as a critic, precision rises to 92.8 while recall falls to 82.3. This is not an accident. It reflects the paper’s RL-oriented operating point: false positives are more damaging than moderate recall loss.

The Judge ablation is the funniest and the most useful. If the system marks a trajectory successful only when all intermediate milestones are correct, recall collapses to 5.0. That is what happens when a critic confuses a messy but successful workflow with a failed workflow. Real agents recover. Good reward systems must understand recovery.

Assignment goals are small details with large consequences

One of the more practical details in the paper is the “assignment goal.” When the Selector proposes a milestone, it also generates a concrete assessment goal for the Verifier. This tells the Verifier exactly what observable outcome to check.

That sounds mundane. It is not.

Without assignment goals, the Verifier becomes lenient. It may approve vague progress because it is not anchored to a precise success condition. In the ablation, OS-Themis without assignment goals reaches 86.9% accuracy, 84.6% precision, and 90.0% recall. With assignment goals, it reaches 88.0% accuracy, 92.8% precision, and 82.3% recall.

Variant	Accuracy	Precision	Recall
Without assignment goal	86.9	84.6	90.0
With assignment goal	88.0	92.8	82.3

Again, the pattern repeats: stricter evidence specification sacrifices some recall but sharply improves precision. In business language, the system becomes less likely to certify bad work.

This is one of the clearest operational lessons from the paper. If an enterprise wants to evaluate AI workflows, it should not merely ask, “Did the task succeed?” It should define observable subgoals: the invoice amount matches source document X, the approval field shows status Y, the CRM record contains Z, the confirmation page displays transaction ID W. Vague evaluation prompts produce vague assurance.

Yes, this is less glamorous than saying “agentic AI.” It is also more likely to survive contact with production.

Self-training needs a filter before it needs more data

The paper’s self-evolution experiment is directly relevant to companies dreaming of autonomous data flywheels.

The setup is simple in principle. The system generates tasks, executes them in a containerized Android environment, collects trajectories, filters the trajectories using different reward critics, and then fine-tunes Qwen3-VL backbones on the resulting data. The raw collected dataset contains 15,110 trajectories.

The result is blunt: fine-tuning on OS-Themis-filtered data improves performance, while using all unfiltered data can degrade performance. The paper reports gains of 6.9 points for Qwen3-VL-4B and 5.0 points for Qwen3-VL-8B over the respective baselines when using OS-Themis-filtered trajectories.

This is the business lesson many AI teams learn the expensive way: a data flywheel without quality control is not a flywheel. It is a centrifuge. It spins faster while separating your budget from your expectations.

OS-Themis’s role in that loop is not content generation. It is acceptance control. It decides which self-collected trajectories are clean enough to become training material. That makes the reward critic a data curation layer, not just an RL component.

For enterprise GUI automation, this distinction matters. The same architecture can support several operational uses:

Business use	What OS-Themis-like evaluation would do	What remains uncertain
Agent training	Provide binary rewards for online RL from interaction trajectories	Whether the environment can be safely sandboxed and scaled
Self-training data curation	Filter successful trajectories before supervised fine-tuning	Whether the critic’s success definition matches enterprise policy
Workflow monitoring	Audit whether deployed agents completed UI tasks correctly	Latency and cost may be too high for every transaction
Failure diagnosis	Produce milestone-level evidence about where the task failed	Evidence quality depends on screenshots and visible UI state
Governance review	Keep an inspectable chain of evaluation decisions	VLM-based judgment is still probabilistic and may need human review

The business value is therefore not just better agent scores. It is cheaper diagnosis and cleaner data. Those are less exciting claims, but far more deployable.

Cost and latency are not footnotes when the evaluator watches screenshots

The paper includes a cost and latency analysis that should not be skipped. On OGRBench, OS-Themis averages 117.6 seconds of latency per trajectory, 164,624 prompt tokens, 6,416.8 completion tokens, and 14.1 calls.

Per-trajectory metric on OGRBench	Reported value
Latency	117.6 seconds
Prompt tokens	164,624.0
Completion tokens	6,416.8
Calls	14.1

This is not lightweight. The authors argue that self-hosted Qwen3-VL models, prefix caching, and asynchronous evaluation can make the overhead manageable, especially because reward calculation is decoupled from environment rollouts. That is plausible for research infrastructure and batch training pipelines.

For production business use, it creates a segmentation problem. OS-Themis-like critics are most attractive where evaluation quality is worth the cost: training data filtration, high-risk workflow certification, sampled audits, offline validation, and post-failure diagnosis. They are less obviously suitable for low-value, high-volume UI actions where every transaction must be judged instantly.

The practical design pattern may be tiered evaluation:

Cheap deterministic checks for easy success conditions.
Lightweight model checks for routine ambiguity.
OS-Themis-style audited evaluation for high-value, high-risk, or training-critical trajectories.
Human review for cases where the critic remains uncertain or the consequence of a false positive is unacceptable.

That is not a weakness of the paper. It is how serious systems are usually built. The expensive evaluator should not be asked to inspect every paperclip.

The boundary conditions are as important as the benchmark gains

OS-Themis is powerful because it turns UI reward judgment into structured evidence evaluation. It is also bounded by the quality of the evidence and the deployment environment.

First, the critic depends on observable UI state. If the decisive condition is hidden in backend state, future processing, an email sent later, or a database update not visible in the screenshot, the visual evidence chain may be insufficient. The system can only verify what the trajectory exposes.

Second, the reward remains VLM-based. It is more structured than a one-shot judge, but it is still probabilistic semantic judgment. The paper explicitly raises the risk of semantic reward hacking: agents may learn behaviors that satisfy the critic’s interpretation without satisfying human intent.

Third, privacy is not optional. GUI trajectories contain screenshots, and screenshots contain real data. The paper argues that in-the-wild online training requires strict safeguards, local deployment, data sanitization, or removal of personally identifiable information before inference. For enterprise environments, this is not a nice-to-have. Screenshots of financial, HR, legal, or medical workflows are not harmless pixels.

Fourth, the scaling evidence is promising but not complete. The paper’s 1,024-task scaling pilot and AndroidWorld gains show feasibility, not a universal scaling law. The authors themselves note constraints around environment parallelism, coordination, hardware, and task initialization.

Fifth, OGRBench is broader than prior GUI ORM benchmarks, but not perfect. macOSArena is imbalanced because current agents have low success rates on macOS tasks. AgentRewardBench transfer also reveals lower recall due to class imbalance and distribution mismatch. These details do not invalidate the results. They prevent lazy overgeneralization, which is a public service.

What Cognaptus infers for business automation

The paper directly shows three things.

First, structured multi-agent criticism improves GUI outcome reward evaluation across a cross-platform benchmark. Second, better reward critics can improve online RL performance on AndroidWorld. Third, OS-Themis-filtered trajectories improve supervised fine-tuning, while unfiltered data can hurt.

Cognaptus’s business inference is narrower but useful: companies building GUI agents should treat evaluation architecture as a product component, not as an afterthought. A deployed computer-use agent needs at least three separate capabilities: execution, evidence capture, and reward/audit judgment. Most demos overinvest in the first and underinvest in the other two. That is why they look impressive until somebody asks whether the work was actually done.

The practical pathway looks like this:

Agent executes workflow
  -> system captures screenshots, actions, and relevant metadata
  -> milestone selector identifies decisive control points
  -> verifier checks visible state changes
  -> reviewer audits missing or weak evidence
  -> judge assigns success/failure/uncertain status
  -> result feeds training, monitoring, or human review

For internal automation, this can become a workflow assurance layer. For self-training, it becomes a data quality filter. For governance, it becomes an audit trail. For product teams, it becomes a way to compare agents not only by task success rate, but by whether their claimed success can be defended.

The uncertain part is ROI. OS-Themis-style evaluation is not free, not instant, and not deterministic. Its value depends on task risk, transaction volume, model hosting cost, screenshot sensitivity, and how often false positives currently escape detection. In some workflows, simple rules will be better. In others, rules will not scale and one-shot judges will be too sloppy. That middle zone is where OS-Themis becomes interesting.

The real contribution is disciplined distrust

The title says Themis knows best, but the paper’s deeper message is that Themis does not trust first impressions.

OS-Themis works because it refuses to treat GUI reward evaluation as a single act of judgment. It decomposes the task, demands observable milestones, checks local evidence, audits the chain, and then makes an outcome decision. That is not merely a model architecture. It is an epistemic posture: do not reward what you cannot verify.

This posture is likely to matter more as AI agents move from chat windows into software environments where actions have consequences. The next generation of GUI agents will not be limited by whether they can click buttons. They will be limited by whether training systems can tell good work from plausible work.

And plausible work is the dangerous kind. It looks complete. It logs success. It makes everyone comfortable until the invoice, ticket, order, or compliance record turns out to be wrong.

OS-Themis is not a final answer to agent evaluation. It is too expensive for many real-time settings, still dependent on VLM judgment, and not immune to privacy or reward-hacking risks. But it makes a strong case that the reward layer is becoming a first-class system design problem.

For businesses, that is the useful lesson. Do not ask only whether your AI agent can perform the task. Ask whether your system can prove it performed the task, identify where it failed, and avoid training future agents on beautiful mistakes.

The machines are learning from their judges now. We may want the judges to keep receipts.

Cognaptus: Automate the Present, Incubate the Future.

Zehao Li et al., “OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards,” arXiv:2603.19191v1, 19 March 2026, https://arxiv.org/abs/2603.19191. ↩︎

The naive judge sees a long movie and gives a thumbs-up#

OS-Themis turns trajectory judgment into an audited evidence workflow#

The paper’s central bet is precision, not maximum coverage#

Milestones are the paper’s practical compression layer#

OGRBench tests whether the critic generalizes across GUI worlds#

The online RL tests show reward quality becoming model performance#

The ablations show why this is not just “four agents are better than one”#

Assignment goals are small details with large consequences#

Self-training needs a filter before it needs more data#

Cost and latency are not footnotes when the evaluator watches screenshots#

The boundary conditions are as important as the benchmark gains#

What Cognaptus infers for business automation#

The real contribution is disciplined distrust#