From Reactive to Preemptive: Benchmarking the Rise of Proactive Mobile Agents

Opening — Why This Matters Now

Mobile AI agents are impressive—until you notice they mostly wait.

Today’s multimodal large language models (MLLMs) can read screens, parse instructions, and execute multi-step workflows. But they operate inside a narrow contract: tell me what to do, and I will do it.

The real frontier is different. It is not faster execution. It is anticipation.

The paper “ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices” reframes the problem: what if mobile agents inferred intent before users articulated it—and acted safely, correctly, and executably?

That shift is subtle in phrasing. It is massive in systems implications.

And according to the authors’ experiments, even the strongest frontier models struggle.

Background — The Reactive Comfort Zone

Most current mobile agents fall into what we can call the reactive paradigm:

User provides explicit command
Model parses instruction
Agent executes function

The cognitive burden remains entirely human.

Proactive intelligence changes the equation. Instead of waiting for instruction, the agent must:

Infer latent user intent
Resolve ambiguity
Avoid false triggers
Map intention into executable actions
Decide when not to act

This is not just UX polish. It introduces an entirely new risk profile:

Incorrect actions become costly
Over-triggering damages trust
Under-triggering negates value

Previous benchmarks simplified this problem by assuming:

A single “correct” action per scenario
Natural language recommendations only
Text similarity as evaluation proxy

In real life, user intent is rarely one-to-one.

The ProactiveMobile benchmark rejects that simplification.

Analysis — Formalizing Proactivity as a Structured Task

1. Four-Dimensional Context Modeling

ProactiveMobile defines intent inference as a function of four contextual signals:

Dimension	Description	Risk if Ignored
User Profile	Long-term habits & preferences	Generic suggestions
Device Status	Battery, location, connectivity	Mis-timed actions
World Information	Weather, holidays, time	Context blindness
Behavioral Trajectories	Sequential user-device interactions	Intent misreading

Formally:

$$ T = \text{Predict}(U, D, W, B) $$

This forces agents to reason across temporal, environmental, and behavioral layers.

2. One-to-Many Ground Truth

Each scenario includes 1–3 valid proactive actions.

This acknowledges a structural reality:

Good proactive assistance is subjective.

Evaluation therefore becomes a best-match selection problem, not exact string comparison.

3. Executable Function Sequences

This is the most important design decision.

Instead of producing textual advice, models must output structured API sequences drawn from a 63-function pool.

That transforms evaluation from semantic similarity into:

Functional equivalence
Executability
Parameter correctness

In other words, it bridges suggestion and action.

4. Noise Injection for Robustness

The dataset injects irrelevant but coherent context—5–20× the volume of task-relevant information.

This tests signal extraction under distraction.

Which, frankly, mirrors real smartphone usage.

Dataset Scope and Design

The benchmark contains:

Split	Scenes	Items	Intents	Modality
Train	12	4,438	8,977	Text + Multimodal
Test	14	1,832	3,711	Text + Multimodal

Two test scenarios are out-of-distribution (OOD), enabling generalization testing.

A three-tier difficulty system (L1–L3) is defined by how many frontier models solve each instance.

This is an elegant idea: difficulty emerges from empirical model consensus.

Findings — The Reality Check

Overall Success Rate (SR)

Model	Avg SR (%)
GPT-5	7.39
GPT-4o	6.80
Gemini-2.5-Pro	8.91
o1	15.71
Qwen2.5-VL-7B + Proactive	19.15

Two insights stand out:

Proactivity is not emergent in general-purpose models.
It is learnable with targeted fine-tuning.

Even so, 19.15% success is hardly production-ready.

Multimodal Bottleneck

For the top-performing model:

Modality	Success Rate
Text	24.29%
Multimodal	14.03%

Grounding proactive inference in noisy GUI screenshots remains substantially harder.

This confirms that multimodal reasoning—not language—remains the dominant bottleneck.

Safety Trade-Off (False Trigger Rate)

The study compares output strategies:

Strategy	SR	FTR
Function Only	8.44%	100%
Think + Function	5.85%	99.89%
Rec + Func (Primary)	19.15%	14.77%
Think + Rec + Func	7.38%	2.21%

Generating a textual recommendation before function execution dramatically reduces unsafe triggering.

In other words:

Forcing the model to articulate intent improves restraint.

That is a governance insight as much as a modeling insight.

Implications — Beyond Benchmarking

1. Proactivity Is a Specialized Capability

Scale alone does not solve it.

o1 performs well in OOD scenarios (18.75%), likely leveraging broader pretraining. But fine-tuned models close the gap significantly.

This suggests proactive logic is a structured skill—not just a byproduct of parameter count.

2. Execution-Centric Evaluation Is the Future

Natural language evaluation is insufficient for agent systems.

If your AI cannot map intent to execution, you are benchmarking rhetoric—not capability.

3. Safety Requires Structured Reasoning

The ablation study reveals a subtle design tension:

Higher SR often increases false triggers
Adding reasoning reduces risk but lowers precision

For businesses deploying proactive agents, this becomes a product decision:

Optimize for initiative or optimize for restraint?

The optimal balance depends on domain risk.

4. Benchmark as Infrastructure

The benchmark cost $210,000 and four months of expert auditing.

That signals something important:

Serious agent evaluation requires infrastructure, not just prompt engineering.

Conclusion — The Hard Part Has Just Begun

ProactiveMobile does not demonstrate that proactive intelligence is solved.

It demonstrates that it is measurable.

That distinction matters.

Reactive agents are assistants. Proactive agents are collaborators.

But collaboration requires judgment, timing, and restraint.

With a 19% ceiling, we are still early.

Which is precisely why this benchmark is valuable.

It exposes the gap between impressive demos and deployable intelligence.

And for operators building AI products—not research prototypes—that gap is the only number that matters.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why This Matters Now#

Background — The Reactive Comfort Zone#

Analysis — Formalizing Proactivity as a Structured Task#

1. Four-Dimensional Context Modeling#

2. One-to-Many Ground Truth#

3. Executable Function Sequences#

4. Noise Injection for Robustness#

Dataset Scope and Design#

Findings — The Reality Check#

Overall Success Rate (SR)#

Multimodal Bottleneck#

Safety Trade-Off (False Trigger Rate)#

Implications — Beyond Benchmarking#

1. Proactivity Is a Specialized Capability#

2. Execution-Centric Evaluation Is the Future#

3. Safety Requires Structured Reasoning#

4. Benchmark as Infrastructure#

Conclusion — The Hard Part Has Just Begun#