TL;DR for operators

AI reliability is increasingly a path problem, not a score problem.

One paper argues that post-training methods such as supervised fine-tuning, reinforcement learning, and on-policy distillation should be understood by asking where supervision is applied in the model’s state space.1 Another argues that GUI-agent software evaluation fails when a single unsuccessful rollout is treated as proof of a broken application, even though the evaluator has only inspected one path through a larger UI state graph.2

The shared lesson is blunt: the system’s behaviour is governed by the states it actually visits. Not the label on the training recipe. Not the benchmark headline. Not the judge’s first verdict. Those are comforting abstractions, which is exactly why they are dangerous.

For operators, this creates a more useful reliability checklist:

Lifecycle stage Wrong question Better question
Post-training “Are we using SFT, RL, or distillation?” “Which states receive supervision, and are they states the model actually visits?”
Evaluation “Did the GUI agent pass or fail?” “Which path did it inspect, and what alternative paths were probed?”
Governance “Is the model good enough?” “Where does evidence come from, and what uncertainty remains untested?”

The conclusion is not that every business needs to build research-grade post-training pipelines or GUI-agent diagnostic protocols tomorrow morning. Calm down, procurement. The conclusion is that AI systems acting over time must be managed through state visitation, trajectory evidence, and controlled diagnosis. Otherwise, teams end up tuning the wrong thing and trusting the first plausible failure report.

Why this matters now

AI systems are becoming less like calculators and more like junior employees with browser access. They search, click, reason, retry, generate code, inspect interfaces, and leave behind logs that look authoritative enough to survive a management meeting. This creates a new reliability problem.

Static outputs are easy to judge badly. Sequential behaviour is easier to judge confidently and badly.

A chatbot answer can be wrong. An agent run can be wrong in more interesting ways. It may choose a bad path, misread a screen, click the wrong element, fail to recover, or declare the environment broken because it did not find the one button hiding slightly below the fold. Likewise, a model may look improved after post-training because it learned useful behaviour on the training examples, while quietly losing robustness on states it actually generates during deployment.

Both papers in this cluster attack that problem from different sides of the lifecycle. The post-training paper focuses on how a language model becomes specialised. The DiagEval paper focuses on how an agentic evaluator decides whether generated software works. They are not about the same benchmark, model, or method. That is precisely why they make a useful pair.

Together, they say: stop treating AI reliability as an object-level property. It is a path-dependent property.

The common frame: a model is a policy moving through states

The post-training paper gives the cleanest vocabulary for the shared idea. For an autoregressive language model, a “state” is not a mystical hidden vector. It is the prompt plus the generated prefix so far:

$$ s_t = (x, y_{

The model predicts the next token from that state. Over a full answer, it moves through a trajectory of such states. That framing sounds obvious, but it quietly rearranges the usual discussion of post-training.

Most business discussions treat training methods as labels: SFT, RL, distillation, DPO, preference tuning, whatever acronym has escaped the lab this quarter. The paper argues that the label hides two separate choices:

Dimension Meaning Why it matters
State source Where the training contexts come from Determines where updates are applied
Signal source What supervision is used Determines what feedback the model receives

SFT trains on fixed dataset states. RL trains on states sampled from the current policy. On-policy distillation, in the paper’s framing, lets the student generate the states while the teacher supplies local guidance.

That distinction matters because two methods can use similar-looking token-level signals while shaping different regions of behaviour. A supervised update on a pristine demonstration prefix is not the same operational event as a supervised update on a messy prefix produced by the learner itself. The token may be identical. The state is not. Anyone who has tried to train staff using idealised process manuals will recognise the pattern. Manuals are lovely. Reality has pop-ups.

The DiagEval paper uses a parallel idea for GUI-agent evaluation. It models software under test as a latent UI state-transition graph. A successful test means that some valid path reaches the target state. A failed rollout, however, only shows that the particular path attempted by the evaluator did not reach the target.

That is not the same as proving the target is unreachable.

So the two papers form a useful chain:

  1. During training, reliability depends on where supervision touches the model’s state space.
  2. During evaluation, reliability depends on which paths are inspected and how failures are diagnosed.
  3. During deployment, governance must treat states and trajectories as first-class evidence.

This is the article’s spine. Not “Paper A says X, Paper B says Y.” That would be the academic equivalent of reading two receipts aloud.

Training-side mechanism: post-training is not just about the loss

The post-training paper studies Qwen3-0.6B-Base in a controlled, small-scale setting. GSM8K is the target task; TruthfulQA and MMLU are used as retention checks. The authors compare mild SFT, stress SFT, on-policy distillation variants, and lightweight on-policy RL.

The most useful finding is not “RL good” or “SFT bad.” That would be too convenient, and therefore probably wrong.

The paper’s results are more careful:

Method/result pattern What the paper shows What it does not prove
Mild SFT GSM8K improves with little retention loss SFT is always safe
Stress SFT Target performance and retention degrade SFT is inherently destructive
OPD from degraded teacher Student surpasses the degraded teacher on measured tasks Any student can always beat any teacher
Lightweight on-policy RL GSM8K improves while retention is largely preserved RL is universally stable or cheap
Scalar drift metric Similar MMD drift can correspond to different retention outcomes Drift metrics are useless

This is exactly the sort of nuance that tends to die in vendor decks.

The central interpretation is that supervision has an address. If dense supervised updates are applied repeatedly to narrow, fixed, off-policy trajectories, the model can be pushed into behaviour that does not align well with its own future rollouts. That is the stress SFT risk. But if supervision is applied on states induced by the learner, as in on-policy RL or continuation-based on-policy distillation, the update can be more local to the model’s actual behaviour.

The OPD result is especially interesting because it undermines the lazy view of distillation as “copy the teacher.” In the paper, the degraded teacher performs poorly as a global rollout policy. Yet the student can still benefit from the teacher’s local continuations when those continuations are requested from student-generated states. In plain terms: the teacher may be bad at driving its own car, but still useful when asked what to do at a junction the student actually reached.

That is a business-relevant distinction. Many organisations now fine-tune or distil models using synthetic data from larger systems. The usual question is whether the teacher is strong enough. This paper suggests a sharper question: are you copying the teacher’s trajectories, or querying useful guidance on the learner’s own states?

Those are different products wearing the same lab coat.

Evaluation-side mechanism: one failed rollout is not a verdict

DiagEval enters from the other end of the lifecycle. It is not trying to improve a model through post-training. It is trying to make GUI-agent evaluation of generated software more reliable after an initial failure.

The problem is structural. Interactive software correctness is graph-level: there may be many possible UI paths to a goal. A GUI evaluator observes a single trajectory. If that trajectory fails, the failure may come from the software, or from the evaluator’s own poor exploration, perception, reasoning, or execution.

The paper names this the single-trajectory identifiability gap. A failed path does not identify the cause of failure.

DiagEval responds by turning failure into diagnosis. Instead of blindly retrying from scratch, it reuses the failed trajectory, identifies a fork point, classifies the dominant source of uncertainty, generates targeted diagnostic probes, ranks them by expected information value, and updates an internal attribution signal. Importantly, the authors are careful that this score is not a calibrated posterior probability. It is an internal diagnostic signal. That caveat matters, because “Bayesian-ish score becomes executive truth” is how dashboards acquire religious authority.

The protocol distinguishes several sources of uncertainty:

Source of uncertainty What can go wrong Diagnostic probe type
Imperfect grounding The agent clicked or targeted the wrong UI element Alternative transitions
Incomplete observation Relevant state is hidden, off-screen, collapsed, or not exposed Observation expansion
Reasoning hallucination The agent misreads UI semantics or rationalises failure Alternative transitions or reproducibility checks
Runtime instability Network, rendering, backend, or flaky execution issues distort the run Reproducibility tests

The results show why this matters. On WebDevJudge-Unit and RealDevBench, DiagEval recovers a substantial share of false negatives and improves full-set accuracy relative to retry-style baselines. With Gemini 3 Flash Preview as the GUI-execution backbone, the paper reports improvements from 69.9% to 78.3% on WebDevJudge-Unit and from 65.0% to 81.6% on RealDevBench after two diagnostic rounds. False-negative recovery reaches 45.6–62.1% depending on setting, compared with lower recovery for retry-based baselines.

The business point is not “use DiagEval exactly as written.” The business point is that failed agentic evaluation should not be treated as a terminal label. It should become a structured evidence-acquisition process.

A human QA lead already knows this instinctively. If a tester says, “The checkout flow is broken,” a competent lead asks: which browser, which account state, which cart, which click path, which error, did you try the alternative payment route, did you inspect the network call? DiagEval is essentially bringing that diagnostic discipline into GUI-agent evaluation. Charming that we had to rediscover QA after putting a robot in front of Chrome.

The combined logic chain

The two papers are most useful when read as a lifecycle argument.

Step Training-side version Evaluation-side version Operator lesson
1. Treat behaviour as sequential The model generates prefixes and acts from states The GUI agent moves through UI states Do not evaluate only final outputs
2. Identify the state source Dataset, teacher, student, or current policy Initial failed path, fork state, probe branch Ask where evidence and updates come from
3. Avoid false certainty Objective names do not explain outcomes alone A failed rollout does not prove software failure Beware confident labels from partial paths
4. Add local correction OPD and RL apply signals on learner-induced states DiagEval probes alternative branches from failure points Repair or diagnose where the system actually goes
5. Govern the trajectory Track retention, drift, and state locality Track false negatives, true negatives, and attribution evidence Build reliability controls around paths, not slogans

This is the shared thesis: state visitation is the hidden variable.

Training and evaluation are often treated as separate worlds. In one world, teams argue about loss functions. In the other, teams argue about benchmark scores. These papers imply that both conversations are under-specified unless they say where the system was when feedback was applied or evidence was collected.

What the papers show versus what operators should infer

It is worth separating the evidence from the business interpretation.

The post-training paper shows, in a deliberately small setup, that state source helps explain why mild SFT can work, stress SFT can damage retention, OPD can outperform a degraded teacher, and lightweight on-policy RL can improve a target task with little measured retention loss. It also shows that scalar rollout drift, at least as measured in that setup, does not fully explain forgetting.

The DiagEval paper shows, in interactive software benchmarks, that targeted post-failure diagnostic probing can improve false-negative recovery and accuracy compared with retry-based evaluation. It also shows that attribution should be treated as structured evidence, not merely as a second roll of the same dice.

The business interpretation is broader but should remain disciplined:

  1. For model adaptation, method names are not enough. Ask whether updates are off-policy, on-policy, teacher-driven, learner-driven, dense, sparse, local, or global.
  2. For agent evaluation, first failures are weak evidence. Ask whether the evaluator explored enough of the reachable state space before blaming the software.
  3. For governance, logs are not automatically explanations. A log is one path through the system, not the system itself.
  4. For vendor assessment, benchmark gains need a state story. Did the model learn robust behaviour on likely deployment states, or did it get better at polished demonstration states?
  5. For QA cost control, smarter probing may beat brute-force retries. More attempts are not the same as more information.

This distinction matters because businesses love scorecards. Scorecards are clean. States are messy. Unfortunately, the messy thing is usually where the risk lives.

A practical framework: state-aware AI reliability

Operators do not need to reproduce the papers’ experiments to use the lesson. They need a better operating frame.

1. Map the states that matter

For an LLM workflow, states may include prompt types, partial reasoning traces, tool outputs, retrieved documents, user corrections, prior conversation context, and error-recovery prefixes.

For a GUI agent, states may include screenshots, DOM snapshots, visible controls, hidden panels, session state, account permissions, browser state, backend latency, and prior clicks.

The question is not “What is the average task?” The question is: which states does the system visit when it succeeds, when it fails, and when it half-succeeds while sounding confident?

2. Match supervision to deployment states

If the model will operate on messy user prompts, tool outputs, and partial self-generated plans, training only on ideal demonstrations is a polite form of denial. SFT may still be useful and efficient, as the post-training paper’s mild SFT result shows. But the risk grows when dense supervision over external trajectories pushes the model away from its own recoverable state distribution.

A practical adaptation plan should therefore specify:

Control Question
State sampling Are training states drawn from real or simulated deployment rollouts?
Signal source Is feedback from humans, teachers, rewards, tools, or verifiers?
Locality Does the update touch states the model actually visits?
Retention checks Which non-target capabilities are measured after adaptation?
Recovery behaviour Is the model trained on its own errors or only on clean exemplars?

3. Diagnose failures before escalating conclusions

For agentic evaluation, a failed run should trigger a diagnostic protocol, not an immediate funeral for the software.

A lightweight business version of DiagEval would ask:

Diagnostic question Why it matters
Did the agent click the intended element? Separates UI defect from grounding error
Did the agent inspect hidden or off-screen states? Separates absence from non-observation
Did an alternative route reach the same goal? Tests reachability beyond one path
Is the failure reproducible under controlled repetition? Separates stable defect from transient execution issue
Did probes preserve true negatives? Prevents reckless conversion of all failures into passes

The last point is important. A diagnostic system that flips too many failures into passes is not reliable. It is merely optimistic, which is a charming trait in children and a poor trait in software evaluators.

4. Treat metrics as evidence summaries, not reality

The post-training paper’s scalar drift result is a useful warning. Stress SFT and OPD from the stress teacher can show nearly identical MMD drift while producing very different retention outcomes. In other words, one number can hide the difference between destructive movement and local repair.

DiagEval makes a similar point from the evaluation side. A pass/fail verdict hides whether the evaluator found a valid path, missed a valid path, or encountered a genuine environment boundary. The final label is useful only when the evidence process behind it is understood.

Metrics should therefore be accompanied by state and trajectory metadata. Not as decorative telemetry. As part of the claim.

Where this stops

There are limits.

The post-training paper is intentionally small-scale: one base model, LoRA adapters, GSM8K as the target, TruthfulQA and a selected MMLU subset for retention, and lightweight drift estimates. It is best read as a mechanistic argument with suggestive evidence, not a universal benchmark verdict.

DiagEval is more applied, but its attribution machinery uses hand-tuned likelihood hyperparameters, prompt-implemented branch ranking, and a binary split between evaluator-side and environment-side failure. The paper itself notes that the attribution score is internal, not calibrated. That is a feature of the authors’ caution, not a flaw to be airbrushed away.

So the right takeaway is not “state-aware methods solve reliability.” They do not. The right takeaway is that reliability work ignoring states and trajectories is probably solving a simplified version of the wrong problem.

The operator’s version of the thesis

Here is the compressed form:

A sequential AI system is only as reliable as the states it learns from and the paths through which its behaviour is tested.

That thesis changes the management questions.

Do not ask only whether a model was fine-tuned. Ask what state distribution it was fine-tuned on.

Do not ask only whether the evaluator failed the app. Ask whether the evaluator gathered enough path evidence to distinguish a real defect from its own execution failure.

Do not ask only whether the benchmark went up. Ask what kinds of states were improved, what kinds were neglected, and which failure paths remain unprobed.

The serious work is no longer just selecting models. It is designing the conditions under which models learn, act, fail, and get diagnosed.

That sounds less glamorous than “deploying autonomous AI.” Good. Glamour is what happens before the incident report.

Notes

Cognaptus: Automate the Present, Incubate the Future.


  1. Dong Nie, “Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation,” arXiv:2605.22731, 2026. ↩︎

  2. Sirui Hong, Zhijie Liu, Tengfei Li, Wei Tao, Yifan Wu, and Chenglin Wu, “DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents,” arXiv:2605.17439, 2026. ↩︎