Opening — Why this matters now
For years, autonomous web agents have promised to automate the internet: booking flights, scraping dashboards, configuring enterprise tools, or simply clicking buttons so humans don’t have to. And yet, anyone who has actually tried to deploy one knows the truth—these agents fail in embarrassingly human ways. They get lost. They click the wrong thing. They forget what they were doing halfway through.
The problem is not raw intelligence. Modern multimodal models can reason, plan, and even explain their mistakes. The real bottleneck is reliability in messy, dynamic, real-world interfaces.
The paper Avenir-Web: Human-Experience-Imitating Multimodal Web Agents with Mixture of Grounding Experts addresses this problem head-on. Instead of chasing bigger models or deeper prompts, it asks a more uncomfortable question: what if web agents fail because they don’t behave like humans at all?
Background — Why web agents keep breaking
Most web agents today fall into one of two camps:
- DOM-centric agents that reason over HTML trees and accessibility metadata.
- Vision-centric agents that treat the page as an image and click by coordinates.
Both approaches work—until they don’t. Modern websites are hostile environments: nested iframes, canvas-rendered UI, shadow DOMs, modals that appear mid-task, and stateful widgets that change behavior based on invisible context.
The authors identify three structural failure modes that repeatedly surface in benchmarks like ONLINE-MIND2WEB:
| Bottleneck | Why it happens | Typical failure |
|---|---|---|
| Element grounding | Single-modality perception | Clicking dead zones, missing iframe elements |
| Procedural knowledge | No site-specific experience | Trial-and-error loops, wasted steps |
| Long-horizon memory | Stateless or shallow memory | Losing subgoals, navigation drift |
In short: agents can see, but not recognize; they can act, but not anticipate; they can remember, but not summarize.
Analysis — What Avenir-Web actually changes
Avenir-Web is not a single model. It is an architectural argument disguised as a system.
The framework introduces four tightly coupled components that together imitate how humans operate on unfamiliar websites.
1. Experience-Imitation Planning (EIP)
Humans do not explore websites blindly. We Google first.
Avenir-Web formalizes this intuition. Before touching the interface, the agent searches for human-authored guides—help pages, forum posts, tutorials—and distills them into a short, site-specific roadmap. This happens once, during initialization.
The key design choice: EIP produces procedural intent, not selectors. It says “filters are in the footer”, not “click div#x”. This decouples strategy from layout and drastically reduces exploration cost.
Without EIP, the agent behaves like a user dropped onto a site with amnesia. With it, the agent behaves like a user who skimmed the manual.
2. Task-Tracking Checklist
Long tasks fail not because of bad steps, but because of forgotten goals.
Avenir-Web decomposes each instruction into 2–6 atomic, observable outcome states—a checklist that persists across pages and transitions. Every action updates exactly one item.
This transforms execution from a fuzzy loop into a verifiable state machine:
- Pending → In progress → Completed → Failed
The checklist acts as an externalized conscience. When progress stalls, the agent knows what is missing, not just that something went wrong.
3. Mixture of Grounding Experts (MoGE)
This is the system’s most consequential design decision.
Instead of committing to DOM-first or vision-first grounding, Avenir-Web treats grounding as a routing problem:
- Default: direct visual grounding on the rendered page
- Fallback: semantic structural reasoning for precision tasks
The agent clicks like a human—on what it sees—not what the DOM claims exists. Structural reasoning is invoked only when visual grounding fails or ambiguity arises.
This inversion matters. It allows Avenir-Web to operate seamlessly inside iframes and canvas-heavy layouts that routinely paralyze DOM-centric agents.
4. Adaptive Memory
Memory is not storage; it is compression.
Avenir-Web maintains a short sliding window of recent actions, while recursively summarizing older interactions into a distilled strategic state. Crucially, failures are preserved immediately through explicit reflection.
This avoids two classic extremes:
- Full-history context overflow → hallucinations
- Fixed window truncation → amnesia
The result is something rare in agents: situational awareness.
Findings — Does it actually work?
Yes—and the numbers are hard to ignore.
On the ONLINE-MIND2WEB benchmark (300 live tasks across 136 websites), Avenir-Web achieves a 53.7% task success rate, establishing a new open-source state of the art.
| Agent | Open Source | Success Rate |
|---|---|---|
| SeeAct | ✓ | 30.0% |
| Browser-Use | ✓ | 26.0% |
| Agent-E | ✓ | 27.0% |
| Avenir-Web (Gemini 3 Pro) | ✓ | 53.7% |
More interestingly, a fully open-source configuration using an 8B model reaches 25.7%, matching older baselines that depend on GPT-4-class backbones.
Ablation studies reveal where the real leverage lies:
| Removed Component | Success Rate |
|---|---|
| Full system | 48.0% |
| – Experience-Imitation Planning | 36.0% |
| – Adaptive Memory | 36–42% |
| – MoGE | 40.0% |
Experience and memory matter more than raw perception.
Implications — What this means beyond benchmarks
Avenir-Web quietly reframes the web agent problem.
The frontier is no longer “Can a model click buttons?” but “Can an agent accumulate, retain, and apply experience?”
For businesses, this has three immediate implications:
- Smaller models can compete when wrapped in the right agent architecture.
- Procedural knowledge beats brute-force exploration, reducing cost and latency.
- Agent reliability is an engineering problem, not a scaling law.
This also explains why proprietary agents feel mysteriously more robust: they are not just bigger—they are more experienced.
Conclusion — The future is boring (and that’s good)
Avenir-Web does not promise AGI. It promises something more valuable: web agents that stop embarrassing themselves.
By imitating how humans prepare, perceive, remember, and recover from failure, the system closes much of the gap between research demos and production automation. The breakthrough is not a model—it is a mindset.
Reliable agents will not emerge from larger transformers alone. They will emerge from architectures that respect how messy the real world actually is.
Cognaptus: Automate the Present, Incubate the Future.