Opening — Why this matters now
In 2025, AI agents are no longer confined to text boxes. They’re moving across screens—scrolling, tapping, and swiping their way through the digital world. Yet the dream of a truly general-purpose mobile control agent—an AI that can use your phone like you do—has remained out of reach. The problem isn’t just teaching machines to see buttons; it’s teaching them to understand intent.
Meta’s new contribution, DigiData, is a serious attempt to close that gap. It reframes mobile interaction not as a visual problem, but as a goal-oriented, multi-modal reasoning task—and it provides both the data and the benchmarks to prove it.
Background — The problem of digital dexterity
Previous efforts to train agents that “use” mobile apps have largely treated them as glorified macro recorders. Datasets like Android in the Wild and AndroidControl recorded human gestures but ignored the underlying purpose. They were shallow—lots of taps, few thoughts. The resulting agents could replicate clicks but not context.
DigiData starts from the opposite end: by curating goals rather than random user sessions. Annotators explored each app as if mapping its psychology—cataloguing features, designing goals that stretch beyond surface-level tasks (e.g., “sort flight options by cheapest refundable tickets”). The result is 8,275 unique goals across 26 apps, expressed through 152,000 verified trajectories and 1.38 million screenshots.
This matters because task diversity breeds intelligence. Where older datasets were heavy on e-commerce or simple navigation, DigiData spans domains—productivity, communication, travel, media, and management. Each goal averages 9.2 human steps, 50% deeper than prior datasets. The agent must now plan, not just imitate.
Analysis — The dataset and its design logic
DigiData’s real innovation is its three-phase data pipeline, which combines human exploration, demonstration, and AI-verified curation:
| Phase | Description | Role |
|---|---|---|
| 1. Goal Curation | Human annotators exhaustively explore each app, defining realistic goals that span the app’s full functionality. | Ensures coverage and task diversity. |
| 2. Demonstration Collection | Annotators perform each goal on emulated or physical devices, recording screenshots and actions. | Produces grounded multi-modal trajectories. |
| 3. Trajectory Verification | LLM judges and humans cross-validate outcomes, rejecting ~5% of bad samples. | Guarantees quality and consistency. |
Beyond screenshots, DigiData logs UI trees and LLM-generated Chain-of-Thought (CoT) data at every step. These “explanatory traces” translate visual actions into human-readable reasoning—why the agent taps where it does, what it expects to happen next. In essence, the dataset teaches models how to explain themselves.
Findings — Benchmarks that finally think
The companion benchmark, DigiData-Bench, is where the story gets empirical. It includes 309 goals across 37 apps, split into three novelty classes:
| Category | Definition | Purpose |
|---|---|---|
| Seen | Apps present in DigiData | Tests memorization |
| Familiar | Apps in similar categories | Tests transfer learning |
| Novel | Entirely unseen app types | Tests generalization |
Agents trained on DigiData were evaluated via both human-assisted and AI-assisted dynamic protocols—essentially, letting an agent perform live on devices while human or LLM judges score success. Static metrics like step accuracy (how many clicks match ground truth) proved misleading: agents with similar click precision diverged sharply in real success.
| Model | Step Accuracy (%) | DigiData-Bench Success Rate (%) |
|---|---|---|
| GPT‑4o | 40.0 | 27.8 |
| Qwen2.5‑VL | 49.2 | 39.2 |
| DigiData 8B (CoT) | 72.8 | 47.3 (human) / 53.6 (LLM) |
Even the best model still falls far below human experts (≈90% success), but that’s not the point. DigiData’s layered evaluation shows where and why models fail: generalization collapses in novel apps, and reasoning consistency lags behind perception.
Implications — From mobile control to digital autonomy
DigiData’s lesson is philosophical as much as technical: interaction data needs structure, not scale. It demonstrates that true “digital dexterity” arises from goal comprehension, not gesture replication.
For enterprises, this opens a credible path to UI‑level automation. Instead of brittle screen recorders, we can imagine agents that adapt across app updates, reason about workflows, and perform compound operations (“book travel, file receipt, update calendar”) autonomously. In other words: RPA meets cognition.
At a research level, DigiData suggests three inflection points:
- Human‑AI hybrid supervision — pairing annotators with LLM judges to clean and validate massive datasets.
- Chain‑of‑Thought as policy trace — using CoT not just for language, but for sequential visual reasoning.
- Dynamic evaluation protocols — benchmarking agents by outcomes, not just by imitation.
Conclusion — Toward agents that can touch, see, and think
Meta’s DigiData is less a dataset than a training ground for embodied intelligence. It treats every tap and swipe as part of a reasoning chain—a cognitive fingerprint of how humans interact with digital systems. The next leap will come when these agents stop replaying human behavior and start abstracting from it—learning not only what we do, but why.
Cognaptus: Automate the Present, Incubate the Future.