Touch Intelligence: How DigiData Trains Agents to Think with Their Fingers

Opening — Why this matters now

In 2025, AI agents are no longer confined to text boxes. They’re moving across screens—scrolling, tapping, and swiping their way through the digital world. Yet the dream of a truly general-purpose mobile control agent—an AI that can use your phone like you do—has remained out of reach. The problem isn’t just teaching machines to see buttons; it’s teaching them to understand intent.

Meta’s new contribution, DigiData, is a serious attempt to close that gap. It reframes mobile interaction not as a visual problem, but as a goal-oriented, multi-modal reasoning task—and it provides both the data and the benchmarks to prove it.

Background — The problem of digital dexterity

Previous efforts to train agents that “use” mobile apps have largely treated them as glorified macro recorders. Datasets like Android in the Wild and AndroidControl recorded human gestures but ignored the underlying purpose. They were shallow—lots of taps, few thoughts. The resulting agents could replicate clicks but not context.

DigiData starts from the opposite end: by curating goals rather than random user sessions. Annotators explored each app as if mapping its psychology—cataloguing features, designing goals that stretch beyond surface-level tasks (e.g., “sort flight options by cheapest refundable tickets”). The result is 8,275 unique goals across 26 apps, expressed through 152,000 verified trajectories and 1.38 million screenshots.

This matters because task diversity breeds intelligence. Where older datasets were heavy on e-commerce or simple navigation, DigiData spans domains—productivity, communication, travel, media, and management. Each goal averages 9.2 human steps, 50% deeper than prior datasets. The agent must now plan, not just imitate.

Analysis — The dataset and its design logic

DigiData’s real innovation is its three-phase data pipeline, which combines human exploration, demonstration, and AI-verified curation:

Phase	Description	Role
1. Goal Curation	Human annotators exhaustively explore each app, defining realistic goals that span the app’s full functionality.	Ensures coverage and task diversity.
2. Demonstration Collection	Annotators perform each goal on emulated or physical devices, recording screenshots and actions.	Produces grounded multi-modal trajectories.
3. Trajectory Verification	LLM judges and humans cross-validate outcomes, rejecting ~5% of bad samples.	Guarantees quality and consistency.

Beyond screenshots, DigiData logs UI trees and LLM-generated Chain-of-Thought (CoT) data at every step. These “explanatory traces” translate visual actions into human-readable reasoning—why the agent taps where it does, what it expects to happen next. In essence, the dataset teaches models how to explain themselves.

Findings — Benchmarks that finally think

The companion benchmark, DigiData-Bench, is where the story gets empirical. It includes 309 goals across 37 apps, split into three novelty classes:

Category	Definition	Purpose
Seen	Apps present in DigiData	Tests memorization
Familiar	Apps in similar categories	Tests transfer learning
Novel	Entirely unseen app types	Tests generalization

Agents trained on DigiData were evaluated via both human-assisted and AI-assisted dynamic protocols—essentially, letting an agent perform live on devices while human or LLM judges score success. Static metrics like step accuracy (how many clicks match ground truth) proved misleading: agents with similar click precision diverged sharply in real success.

Model	Step Accuracy (%)	DigiData-Bench Success Rate (%)
GPT‑4o	40.0	27.8
Qwen2.5‑VL	49.2	39.2
DigiData 8B (CoT)	72.8	47.3 (human) / 53.6 (LLM)

Even the best model still falls far below human experts (≈90% success), but that’s not the point. DigiData’s layered evaluation shows where and why models fail: generalization collapses in novel apps, and reasoning consistency lags behind perception.

Implications — From mobile control to digital autonomy

DigiData’s lesson is philosophical as much as technical: interaction data needs structure, not scale. It demonstrates that true “digital dexterity” arises from goal comprehension, not gesture replication.

For enterprises, this opens a credible path to UI‑level automation. Instead of brittle screen recorders, we can imagine agents that adapt across app updates, reason about workflows, and perform compound operations (“book travel, file receipt, update calendar”) autonomously. In other words: RPA meets cognition.

At a research level, DigiData suggests three inflection points:

Human‑AI hybrid supervision — pairing annotators with LLM judges to clean and validate massive datasets.
Chain‑of‑Thought as policy trace — using CoT not just for language, but for sequential visual reasoning.
Dynamic evaluation protocols — benchmarking agents by outcomes, not just by imitation.

Conclusion — Toward agents that can touch, see, and think

Meta’s DigiData is less a dataset than a training ground for embodied intelligence. It treats every tap and swipe as part of a reasoning chain—a cognitive fingerprint of how humans interact with digital systems. The next leap will come when these agents stop replaying human behavior and start abstracting from it—learning not only what we do, but why.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — The problem of digital dexterity#

Analysis — The dataset and its design logic#

Findings — Benchmarks that finally think#

Implications — From mobile control to digital autonomy#

Conclusion — Toward agents that can touch, see, and think#