Click Like a Human: Why Avenir-Web Is a Quiet Breakthrough in Web Agents

Opening — Why this matters now

For years, autonomous web agents have promised to automate the internet: booking flights, scraping dashboards, configuring enterprise tools, or simply clicking buttons so humans don’t have to. And yet, anyone who has actually tried to deploy one knows the truth—these agents fail in embarrassingly human ways. They get lost. They click the wrong thing. They forget what they were doing halfway through.

The problem is not raw intelligence. Modern multimodal models can reason, plan, and even explain their mistakes. The real bottleneck is reliability in messy, dynamic, real-world interfaces.

The paper Avenir-Web: Human-Experience-Imitating Multimodal Web Agents with Mixture of Grounding Experts addresses this problem head-on. Instead of chasing bigger models or deeper prompts, it asks a more uncomfortable question: what if web agents fail because they don’t behave like humans at all?

Background — Why web agents keep breaking

Most web agents today fall into one of two camps:

DOM-centric agents that reason over HTML trees and accessibility metadata.
Vision-centric agents that treat the page as an image and click by coordinates.

Both approaches work—until they don’t. Modern websites are hostile environments: nested iframes, canvas-rendered UI, shadow DOMs, modals that appear mid-task, and stateful widgets that change behavior based on invisible context.

The authors identify three structural failure modes that repeatedly surface in benchmarks like ONLINE-MIND2WEB:

Bottleneck	Why it happens	Typical failure
Element grounding	Single-modality perception	Clicking dead zones, missing iframe elements
Procedural knowledge	No site-specific experience	Trial-and-error loops, wasted steps
Long-horizon memory	Stateless or shallow memory	Losing subgoals, navigation drift

In short: agents can see, but not recognize; they can act, but not anticipate; they can remember, but not summarize.

Analysis — What Avenir-Web actually changes

Avenir-Web is not a single model. It is an architectural argument disguised as a system.

The framework introduces four tightly coupled components that together imitate how humans operate on unfamiliar websites.

1. Experience-Imitation Planning (EIP)

Humans do not explore websites blindly. We Google first.

Avenir-Web formalizes this intuition. Before touching the interface, the agent searches for human-authored guides—help pages, forum posts, tutorials—and distills them into a short, site-specific roadmap. This happens once, during initialization.

The key design choice: EIP produces procedural intent, not selectors. It says “filters are in the footer”, not “click div#x”. This decouples strategy from layout and drastically reduces exploration cost.

Without EIP, the agent behaves like a user dropped onto a site with amnesia. With it, the agent behaves like a user who skimmed the manual.

2. Task-Tracking Checklist

Long tasks fail not because of bad steps, but because of forgotten goals.

Avenir-Web decomposes each instruction into 2–6 atomic, observable outcome states—a checklist that persists across pages and transitions. Every action updates exactly one item.

This transforms execution from a fuzzy loop into a verifiable state machine:

Pending → In progress → Completed → Failed

The checklist acts as an externalized conscience. When progress stalls, the agent knows what is missing, not just that something went wrong.

3. Mixture of Grounding Experts (MoGE)

This is the system’s most consequential design decision.

Instead of committing to DOM-first or vision-first grounding, Avenir-Web treats grounding as a routing problem:

Default: direct visual grounding on the rendered page
Fallback: semantic structural reasoning for precision tasks

The agent clicks like a human—on what it sees—not what the DOM claims exists. Structural reasoning is invoked only when visual grounding fails or ambiguity arises.

This inversion matters. It allows Avenir-Web to operate seamlessly inside iframes and canvas-heavy layouts that routinely paralyze DOM-centric agents.

4. Adaptive Memory

Memory is not storage; it is compression.

Avenir-Web maintains a short sliding window of recent actions, while recursively summarizing older interactions into a distilled strategic state. Crucially, failures are preserved immediately through explicit reflection.

This avoids two classic extremes:

Full-history context overflow → hallucinations
Fixed window truncation → amnesia

The result is something rare in agents: situational awareness.

Findings — Does it actually work?

Yes—and the numbers are hard to ignore.

On the ONLINE-MIND2WEB benchmark (300 live tasks across 136 websites), Avenir-Web achieves a 53.7% task success rate, establishing a new open-source state of the art.

Agent	Open Source	Success Rate
SeeAct	✓	30.0%
Browser-Use	✓	26.0%
Agent-E	✓	27.0%
Avenir-Web (Gemini 3 Pro)	✓	53.7%

More interestingly, a fully open-source configuration using an 8B model reaches 25.7%, matching older baselines that depend on GPT-4-class backbones.

Ablation studies reveal where the real leverage lies:

Removed Component	Success Rate
Full system	48.0%
– Experience-Imitation Planning	36.0%
– Adaptive Memory	36–42%
– MoGE	40.0%

Experience and memory matter more than raw perception.

Implications — What this means beyond benchmarks

Avenir-Web quietly reframes the web agent problem.

The frontier is no longer “Can a model click buttons?” but “Can an agent accumulate, retain, and apply experience?”

For businesses, this has three immediate implications:

Smaller models can compete when wrapped in the right agent architecture.
Procedural knowledge beats brute-force exploration, reducing cost and latency.
Agent reliability is an engineering problem, not a scaling law.

This also explains why proprietary agents feel mysteriously more robust: they are not just bigger—they are more experienced.

Conclusion — The future is boring (and that’s good)

Avenir-Web does not promise AGI. It promises something more valuable: web agents that stop embarrassing themselves.

By imitating how humans prepare, perceive, remember, and recover from failure, the system closes much of the gap between research demos and production automation. The breakthrough is not a model—it is a mindset.

Reliable agents will not emerge from larger transformers alone. They will emerge from architectures that respect how messy the real world actually is.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — Why web agents keep breaking#

Analysis — What Avenir-Web actually changes#

1. Experience-Imitation Planning (EIP)#

2. Task-Tracking Checklist#

3. Mixture of Grounding Experts (MoGE)#

4. Adaptive Memory#

Findings — Does it actually work?#

Implications — What this means beyond benchmarks#

Conclusion — The future is boring (and that’s good)#