Phones are where automation goes to embarrass itself.

A desktop workflow can often be forced into a neat sequence: open tab, click menu, submit form, pretend the enterprise software was designed by someone who likes people. Mobile apps are less polite. They hide features behind drawers, gestures, modals, permissions, scrolling lists, bottom sheets, dark-pattern-ish confirmations, and the occasional button that looks decorative until it suddenly matters. A human user handles this with a mixture of visual attention, memory, muscle habit, and mild resentment. A mobile control agent has to do it with pixels, UI trees, and a policy that decides where the next finger should land.

That is the real context for DigiData, introduced in DigiData: Training and Evaluating General-Purpose Mobile Control Agents.1 The paper is not just another “we collected a dataset” announcement, although, yes, it collected a dataset. The more interesting claim is mechanical: if we want agents that can operate mobile apps, the training and evaluation pipeline must be built around complete goals, app feature coverage, verified trajectories, and outcome-level judging. More screenshots alone will not save us. Neither will a metric that rewards an agent for tapping where a human once tapped, as if the universe owed every task a single correct route.

DigiData’s thesis is quietly practical: mobile agents need touch intelligence. Not just vision. Not just language. Not just imitation. They need to know how app functions are reached, when a task is actually complete, and why one route through the interface is equivalent to another. The finger matters because the finger is where intent becomes state change.

DigiData is a pipeline, not a pile of screenshots

The easiest way to misread this paper is to treat DigiData as a scale story. It has 152,000 trajectories, 8,275 goals, 26 Android apps, screenshots, UI trees, and synthetic chain-of-thought annotations. Fine. Add it to the growing mountain of agent datasets and move along.

That reading misses the point. DigiData’s contribution is less about raw volume than about how the data is manufactured. The paper’s data collection pipeline has three linked stages: goal curation, demonstration collection, and trajectory verification. Each stage attacks a different failure mode in mobile-agent training.

First, trained annotators explore app features deliberately. They are not merely recording whatever users happen to do. They are mapping app functionality and turning it into natural-language goals. This matters because mobile apps contain many useful functions that are buried several screens deep. If a dataset only captures surface interactions, the resulting agent learns surface behaviour. Very impressive, assuming the user’s ambition is to open the home screen and admire it.

Second, annotators demonstrate those goals on real physical Android devices and secure emulated environments. The demonstrations record the screen state and the action taken at each step. This turns goals into trajectories: sequences of taps, swipes, typing, navigation commands, and completion signals.

Third, trajectories are verified. The paper uses a combination of LLM-based judging and human review, filtering out trajectories that do not actually achieve their stated goals. The authors report that 5.3% of raw DigiData trajectories are removed by this process. They also compare against a sample of Android in the Wild, where human annotators judged only 84% of sampled trajectories as achieving the prescribed goal, versus 94.6% for DigiData before verification and 100% after verification.

That last number should not be read as metaphysical perfection. It means “accepted by the verification protocol,” not “guaranteed by the gods of Android.” Still, the operational point is important. A mobile-control dataset is only useful if the demonstrated task was actually completed. Otherwise, the model is trained on confident failure, which is popular in AI but rarely ideal.

The data is deeper because the goals are deeper

DigiData is smaller than Android in the Wild by trajectory count, but it is much deeper per app. That is the key trade.

Dataset Trajectories Goals Apps Average trajectory length Goals per app Diversity score
Android in the Wild 715k 30,378 357 6.5 85 0.35
AndroidControl 15k 14,548 833 5.5 17 0.43
DigiData 152k 8,275 26 9.2 318 0.45

The business interpretation is straightforward: DigiData spends its annotation budget on depth of feature coverage rather than breadth of app count. It has fewer apps, but many more goals per app. Its average trajectory length is 9.2 steps, compared with 6.5 for Android in the Wild and 5.5 for AndroidControl. In the paper’s framing, that longer length is not an incidental inconvenience. It is evidence that the tasks reach further into app functionality.

This is where the misconception starts to fall apart. Mobile agents are often discussed as if the central problem were visual grounding: can the model see the button and click it? That is necessary, but not sufficient. Many tasks fail long before the final tap. The agent must know which menu to search, whether a toggle already satisfies the goal, whether a typed query needs confirmation, whether a save action has happened, and when to stop.

The authors try to enrich this state-action mapping with multiple modalities. Each DigiData step includes the screenshot, the Android accessibility UI tree, and LLM-generated annotations: screen description, action description, rationale, and expected UI change. These annotations are synthetic, generated with Llama 4, so nobody should confuse them with privileged access to the model’s soul. Still, they give training models a structured representation of what is happening before and after each action.

In practice, this turns each tap into a small causal claim: given this goal and this screen, this action should move the UI toward that outcome. That is more useful than “tap at coordinate 0.23, 0.76 and hope the gods are feeling deterministic.”

Chain-of-thought helps, but not by magic

The paper’s best-performing model is an 8B supervised fine-tuned Perception Language Model trained with chain-of-thought style output. It predicts not only the next parameterised action, but also descriptions of the current state, reasoning, action summary, and expected next state.

The result is meaningful but not miraculous. On DigiData-Bench, the 8B CoT model reaches 47.3% human-evaluated task success, compared with 42.1% for the non-CoT 8B model and 44.3% for the 3B model. Under GPT-4o LLM judging, the same 8B CoT model reaches 53.6%, compared with 48.5% for the non-CoT 8B model.

That is enough to say the synthetic reasoning traces are useful. It is not enough to say they solve mobile control. In fact, the category breakdown is more interesting than the headline. The 8B CoT model improves strongly on novel apps relative to the non-CoT 8B model: 36.7% versus 26.5%. It also improves on seen apps: 51.0% versus 45.2%. But on familiar apps, the CoT model is lower than the non-CoT 8B model: 42.6% versus 44.4%.

So the right reading is not “chain-of-thought makes agents reason.” The safer reading is: structured intermediate supervision can improve action policies and sometimes generalisation, but the effect is uneven. Synthetic CoT is a training scaffold, not a certificate of cognition. Charming, but still a scaffold.

DigiData-Bench measures outcomes, not choreography

DigiData-Bench is the paper’s second major contribution. It contains 309 goals across 37 Android apps and 8 app categories. The benchmark divides tasks into three novelty groups:

Category Meaning What it tests
Seen The app appears in DigiData training data Whether training transfers to new goals inside known apps
Familiar The app is new, but its category appears in training Whether category-level patterns transfer
Novel Neither the app nor its category appears in training Whether the agent can handle out-of-distribution app types

This split is useful because it prevents one of the more common evaluation sins: calling memorisation “generalisation” because it wore a clean shirt.

The benchmark supports human-assisted dynamic evaluation. A human operator initialises the app state, monitors the agent, blocks dangerous actions, and judges whether the final trajectory achieves the goal. The paper also defines DigiData-Bench-Auto, an AI-assisted evaluation protocol for the subset of goals that can be reproduced safely and automatically without manual state setup, login, location information, or risky side effects.

This matters because mobile interaction is stateful. A task like “delete an item,” “save a setting,” or “move the most recent photo into a new folder” is not captured by whether the agent matched a human’s next action at one screenshot. What matters is whether the app ends up in the right state.

The appendix gives a simple example: changing a shipping location to the Dominican Republic can be done through direct search, menu scrolling, or an alphabetic index. All three routes may be valid. Step accuracy, however, usually treats only one demonstrated path as correct. It punishes alternative success. Very academic. Very wrong.

Step accuracy is convenient, and that is the problem

The paper still reports step accuracy, because researchers need cheap proxies during model development. But its central empirical warning is that step accuracy can mis-rank agents.

Model DigiData-Bench step accuracy Human-evaluated success rate
GPT-4o 40.0% 27.8%
Qwen2.5-VL 49.2% 39.2%
Ours 1B 67.6% 35.0%
Ours 3B 70.7% 44.3%
Ours 8B 70.7% 42.1%
Ours 8B CoT 72.8% 47.3%

The mismatch is visible. The 1B model has much higher step accuracy than Qwen2.5-VL, but lower task success. The 3B and 8B models have identical DigiData-Bench step accuracy, yet different human-evaluated success. The 8B CoT model has the highest step accuracy and highest success rate among the paper’s models, but the relationship is not clean enough to trust as the main evaluation signal.

The reason is not mysterious. Step accuracy asks: did the model choose the same next action as a human demonstrator at this point? Dynamic success asks: did the agent complete the task? In UI control, these are different questions. A good agent may take a different route. A bad agent may imitate a plausible step and then fail two screens later. A model can be a competent mime and a poor worker. Many organisations have met this employee already.

The paper’s evidence should push teams away from step-level imitation metrics as the primary success measure. They are useful for development loops. They are not sufficient for product readiness.

The judge is becoming part of the product stack

The paper’s LLM-judge results are not a side detail. They are part of the mechanism. If dynamic evaluation is necessary but human evaluation is expensive, then scalable mobile-agent development needs automated judges that can approximate human assessment.

DigiData’s LLM judge works in two stages. First, a step summarisation module converts each transition into text: what the screen looked like before, what the action did, what changed afterward. Then a judge model receives the goal, the initial and final states, and the transition summaries, and decides whether the goal was achieved.

The authors evaluate judges on a test set containing 229 successful and 327 failed model-generated trajectories. The comparison is revealing.

Evaluation method Accuracy Precision Recall True negative rate Kendall rank correlation with human judgement
Llama 4 Scout 0.82 0.73 0.89 0.77 0.83
Fine-tuned Llama 4 Scout 0.87 0.88 0.79 0.92 0.89
GPT-4o 0.89 0.87 0.86 0.90 0.94
Step accuracy 0.72

This is not a declaration that LLM judges are truth machines. They are not. They are classifiers with errors, biases, and prompt dependence. But the comparison shows that judge-based dynamic evaluation tracks human model rankings better than step accuracy. GPT-4o has the highest rank correlation in the table. Fine-tuned Llama 4 Scout comes close, and its true negative rate is slightly higher than GPT-4o’s, which matters when the evaluation system must reject failed trajectories conservatively.

For business readers, the point is not “replace humans with LLM judges.” The point is more specific: once the task is represented as a trajectory with state transitions, evaluation itself can become semi-automated. That changes the economics of agent development. Teams can run more frequent tests, catch regressions earlier, and reserve expensive human review for calibration, edge cases, and safety-sensitive tasks.

In other words, the judge becomes part of the product stack. Not glamorous. Extremely useful. Like logging, but with more opinions.

How to read the evidence without overreading it

The paper includes several result types. They should not all be treated as equal proof of the same claim.

Evidence Likely purpose What it supports What it does not prove
Dataset comparison with AitW and AndroidControl Main evidence for dataset positioning DigiData is deeper per app, more verified, and more multimodal That it covers the full mobile app universe
DigiData-Bench success rates Main evidence for agent performance DigiData-trained models improve task completion on the benchmark That agents are ready for unsupervised deployment
Seen / familiar / novel split Generalisation analysis Performance drops sharply on novel app categories That the exact same drop will occur in every business domain
CoT versus non-CoT model results Ablation-like comparison Synthetic reasoning traces can improve performance, especially in some splits That generated CoT is faithful or universally beneficial
LLM judge metrics Evaluation-method comparison LLM judges are better aligned with human rankings than step accuracy That LLM judges can replace human review in high-risk settings
Data scaling by app category Sensitivity / scaling test More data helps seen and familiar apps more than novel apps That supervised fine-tuning alone will solve transfer
Task success versus step count Exploratory complexity analysis Longer tasks tend to reduce success rates That step count fully measures task complexity

The most strategically important line is the data-scaling result: adding more DigiData improves overall performance, especially on seen and familiar apps, but not significantly on novel apps. That is the paper’s most useful cold shower. The model gets better where the data gives it structural familiarity. It does not suddenly become a universal app explorer because someone added more examples.

The authors themselves suggest this may point to limitations of supervised fine-tuning and motivate future reinforcement-learning work. That is a reasonable inference. Supervised imitation teaches the agent what successful behaviour looked like in known distributions. It does not necessarily teach robust exploration in a foreign interface.

The business value is not “your phone runs itself tomorrow”

The obvious marketing version of this paper is easy to write: mobile agents are coming, your apps will operate themselves, humans will be liberated from tapping, the future will be frictionless, please enjoy the investor deck. It is also not what the evidence shows.

The more defensible business reading is narrower and better.

First, DigiData shows how to build domain-specific UI-control datasets when APIs are incomplete or unavailable. Many enterprise workflows still live inside apps and portals with poor integration surfaces. If the only available interface is the UI, then agent training must cover real features, not just common clicks. DigiData’s goal-curation protocol is a template: enumerate functions, generate natural goals, demonstrate workflows, verify outcomes.

Second, the paper gives product teams a better evaluation philosophy. A mobile agent should be measured by task completion under realistic state initialisation, not by next-action resemblance. This is directly relevant to QA automation, app regression testing, customer-support copilots, field-service workflows, and internal operations tools. If the task is “change this setting,” “retrieve this record,” or “file this request,” the metric should be whether the system state changed correctly.

Third, the LLM-judge architecture suggests a path to cheaper evaluation infrastructure. A company building app agents could combine automated trajectory collection, step summarisation, LLM judging, and human audits. That creates a practical evaluation loop: broad automated testing, targeted human review, and continuous judge calibration. Not perfect. Better than squinting at click accuracy and calling it productivity.

Fourth, the results clarify where deployment will be easiest. Agents are more plausible in bounded app families, known workflows, and repeated operational tasks. They are less plausible when thrown cold into arbitrary novel apps, ambiguous goals, or workflows with irreversible side effects. The machine can tap. Whether it should tap is still a governance question.

The boundaries are where the strategy lives

DigiData is an important paper because it is concrete. It is also bounded.

The dataset covers 26 Android apps. The benchmark covers 37 apps. That is substantial for research, but tiny compared with the real mobile ecosystem. The paper’s own novelty split shows why this matters: novel app performance remains weak. More examples help most where the agent has already seen the app or category structure.

The best reported human-evaluated agent success rate is 47.3%, while human experts reach 90.1%. That gap is not a footnote. It is the difference between a promising research system and something you would trust with open-ended user workflows. Imagine a mobile assistant that fails more than half the time but does so with excellent multimodal annotations. Fascinating. Still not your operations department.

The evaluation stack also has costs. Human-assisted dynamic evaluation requires setup, monitoring, unsafe-action blocking, and judgement. AI-assisted evaluation reduces that burden, but it only applies to goals that can be safely and reproducibly automated. LLM judges are useful, but imperfect. In regulated, financial, medical, or account-management contexts, “the judge thought it was fine” will not be a sufficient audit trail. Nor should it be.

There is also a subtle limitation around action space. The paper uses a unified action representation: tap, swipe, type, navigate, and status commands. That is useful for training across datasets. But real apps introduce timing delays, animations, permission flows, localisation differences, account states, and UI redesigns. A coordinate that works today may become a small act of comedy after the next app update.

These limitations do not weaken the paper. They define its practical use. DigiData is strongest as a blueprint for building and evaluating UI agents inside constrained domains. It is weaker as evidence that general mobile autonomy is imminent. Conveniently, the first claim is useful; the second is mostly conference coffee.

The deeper lesson: train the route, score the destination

The best way to understand DigiData is to separate training from evaluation.

Training still needs routes. Demonstrations show the agent how humans move through screens, how goals decompose into actions, and how interface state changes after each touch. DigiData improves those routes by curating deeper goals, verifying trajectories, and adding multimodal explanations.

Evaluation, however, must score the destination. A successful agent does not need to imitate one human route if another route achieves the same goal safely. DigiData-Bench’s dynamic evaluation makes that distinction explicit. That is the paper’s cleanest conceptual move.

For businesses, this is the useful rule:

Design question Weak answer Better answer
What data should we collect? Random user sessions Feature-covered goals with verified completions
What should the model learn? Next tap imitation Goal-conditioned state transitions
What should we measure? Step accuracy Task success under realistic setup
How should we scale evaluation? Manual review only Automated judges plus calibrated human audits
Where should we deploy first? Arbitrary apps Bounded workflows with known side-effect rules

This is not a small shift. It moves mobile automation from “record and replay” toward “goal and verify.” That is the difference between a macro and an agent. The industry has spent years dressing macros in agent clothing. DigiData, to its credit, gives the clothing a body to fit.

Touch intelligence is an operations problem

DigiData’s most valuable contribution is not that it teaches an AI to tap. Plenty of models can tap. Some can even tap with confidence, which is adorable until they buy something.

The contribution is that it treats mobile control as an operational loop: discover app features, formulate useful goals, collect demonstrations, verify success, train with multimodal state, evaluate dynamically, and automate part of the judging process. Each step closes a gap that simple imitation leaves open.

That is why the mechanism-first reading matters. If we only summarise the dataset size, we miss the machinery that makes the dataset useful. If we only report the benchmark score, we miss the evaluation argument. If we only celebrate CoT, we miss the much less glamorous and much more important fact: agents improve when their world is instrumented, their goals are meaningful, and their success is judged by outcomes.

The paper does not give us a general mobile agent ready to run everyone’s phone. It gives us a more serious recipe for building one. In this field, that counts as progress. Not cinematic progress. Engineering progress. The kind that arrives with runbooks, verification rubrics, and a healthy distrust of convenient metrics.

The future mobile agent will not merely see the screen. It will know what the screen is for, what state it needs to change, and when its work is done. DigiData is one step toward that kind of touch intelligence.

Cognaptus: Automate the Present, Incubate the Future.


  1. Yuxuan Sun et al., “DigiData: Training and Evaluating General-Purpose Mobile Control Agents,” arXiv:2511.07413, 2025, https://arxiv.org/abs/2511.07413↩︎