Click.
That is where most web-agent demos become either impressive or mildly tragic. The model reads the instruction, understands the goal, produces a confident plan, and then clicks the wrong thing. Or it clicks the right thing before a modal appears. Or it scrolls, forgets why it scrolled, repeats an action, and quietly turns a three-step workflow into interpretive dance.
This is why Avenir-Web is more interesting than its benchmark number alone suggests. The paper, Avenir-Web: Human-Experience-Imitating Multimodal Web Agents with Mixture of Grounding Experts, reports a 53.7% task success rate on Online-Mind2Web, a benchmark of 300 live web tasks across 136 websites.1 That result matters. But the more useful lesson is architectural: Avenir-Web treats web automation not as a single reasoning problem, but as a reliability stack.
That distinction is important for businesses. If a web agent fails, the cause is rarely “the model is not smart enough” in the abstract. The failure is usually more specific. It did not know the site’s workflow. It could not ground the target element. It lost track of subgoals. It remembered too much low-level noise or too little useful history. In other words, it failed the way badly managed humans fail too—except faster, and with fewer apologies.
Avenir-Web’s quiet contribution is that it tries to make agents behave less like prompt-driven tourists and more like experienced operators.
The web-agent bottleneck is a reliability stack, not a model-size contest
The tempting reading is simple: Avenir-Web uses strong multimodal models and gets a higher benchmark score. That reading is not wrong, but it is lazy. The paper’s stronger argument is that web-agent reliability depends on four interacting mechanisms:
| Reliability layer | Avenir-Web component | What it tries to prevent | Business analogue |
|---|---|---|---|
| Procedural prior | Experience-Imitation Planning | Trial-and-error exploration on unfamiliar sites | Read the operating manual before touching the dashboard |
| Grounding | Mixture of Grounding Experts | Clicking the wrong element or failing inside iframes and dynamic UI | Use the screen like a human, then fall back to structure when needed |
| Progress control | Task-Tracking Checklist | Losing subgoals across pages and state changes | Maintain a task checklist, not a vague intention |
| Memory management | Adaptive Memory | Context bloat, amnesia, and repeated failure loops | Keep a compressed incident log, not a raw transcript landfill |
This is a better lens than “new agent beats old agent,” because enterprise automation rarely fails at the headline level. It fails at the joints: the dropdown that is not really a dropdown, the search result hidden below a banner, the modal that appears at the worst possible moment, the calendar picker that requires a visible click but has a DOM structure apparently designed by a committee of haunted architects.
Avenir-Web’s design says: do not ask one mechanism to solve all of that.
Experience-Imitation Planning: the agent reads the manual before it clicks
Humans do not usually approach an unfamiliar website as blank-slate reinforcement learners. We search. We skim help pages. We infer that the “careers” link may be in the footer, that an availability tool may require selecting dates before party size, or that a filter must be applied before sorting.
Avenir-Web formalizes this behavior through Experience-Imitation Planning, or EIP. Before execution begins, the agent searches for site-specific online resources—help pages, documentation, forum posts, or user guides—and compresses them into a short procedural roadmap. The plan is intentionally high-level: two to four imperative directives rather than brittle selectors.
That design choice matters. A selector-level plan would quickly expire when the website changes. A procedural plan is more durable. “Use the footer navigation to find careers” survives more layout changes than div:nth-child(4) > a, which is less an automation strategy than a cry for future maintenance.
EIP is not magic imitation learning. It does not mean the agent has watched thousands of internal enterprise workflows. It means the agent can retrieve public procedural knowledge and inject it into the execution context before the first action. The benefit is not omniscience. The benefit is fewer blind exploratory steps.
For business automation, this is the most transferable idea in the paper. Many companies already have procedural knowledge scattered across SOPs, help-center pages, internal wikis, screenshots, and training documents. A web agent that can retrieve and summarize this knowledge before acting will usually be safer than one that starts clicking from raw perception alone.
The boundary is also clear. If the documentation is outdated, incomplete, or unavailable behind authentication, EIP can mislead. The paper shows the value of procedural priors; it does not prove that any arbitrary knowledge source is reliable enough to drive automation without verification.
MoGE: visual-first clicking, structural recovery when the page fights back
The most visible part of web automation is grounding: deciding where to click, type, select, or scroll. Historically, web agents often leaned toward either DOM-centric methods or visual methods. DOM-centric agents parse HTML and accessibility trees. Visual agents look at screenshots and produce coordinates.
Both approaches are useful. Both are also insufficient.
Modern websites are not polite documents. They contain nested iframes, canvas-rendered widgets, shadow DOMs, overlays, dynamic modals, and stateful components whose behavior may not be obvious from raw HTML. A DOM-centric agent can know that an element exists and still fail to interact with what the user sees. A pure visual agent can see the button but miss the semantic structure needed for precise selection or text input.
Avenir-Web’s Mixture of Grounding Experts, or MoGE, treats grounding as a routing problem. Its default path is visual-first: the agent interacts with the rendered page as a unified visual canvas, closer to how humans use browsers. That helps with interfaces where structural parsing is misleading or incomplete. For edge cases—fine-grained selection, text input, dropdowns, failed clicks—the system falls back to semantic or structural reasoning.
The paper’s comparison is useful because it avoids a false choice. Avenir-Web does not say “ignore the DOM.” It says the DOM should not be the tyrant of the interaction loop. Visual grounding should handle ordinary interaction; structure should become a specialist fallback.
That is a practical lesson for business systems. Enterprise web automation often fails because teams overcommit to one interface representation. Browser scripts assume stable selectors. Computer-vision agents assume every visible element can be clicked reliably. A production-grade agent needs both: human-like visual operation plus structural recovery when the screen lies, shifts, or refuses to cooperate.
The appendix case on allrecipes.com illustrates the point. A SeeAct baseline stalls with repeated non-responsive actions, while Avenir-Web closes a blocking modal and reaches the reviews link. This is qualitative evidence, not a population-level proof. But it explains the mechanism behind the benchmark result: the agent is not just “smarter”; it is better equipped to recover when the page behaves like a page.
The checklist turns vague intention into observable progress
A long web task is not merely a sequence of clicks. It is a sequence of state changes. That is where many agents quietly deteriorate. They execute plausible actions, but the relationship between action and goal becomes fuzzy. After several pages, a modal, and two failed attempts, the model may still sound confident while having no reliable idea which constraints are already satisfied.
Avenir-Web addresses this with a Task-Tracking Checklist. At the start of a task, the system decomposes the instruction into two to six atomic, observable outcome states. Each item receives a status such as pending, in progress, completed, or failed. After each interaction step, a lightweight model updates exactly one checklist item based on the action and observed page state.
That “exactly one” rule is small but revealing. It forces progress tracking to remain local and auditable. Instead of letting the model produce a vague self-evaluation—“I am making progress”—the agent must attach the latest step to a specific requirement.
For example, in the paper’s petfinder.com illustration, the checklist separates constraints such as location, age category, and sort order. The agent can succeed at one item while still failing another. This matters because many business tasks are constraint bundles: find a customer record, apply a date filter, export a specific report, select a region, confirm a status. Partial success is not success. It is the place where silent errors like to hide.
The checklist is therefore less glamorous than multimodal grounding, but more operationally important than it first appears. It converts a web agent from a reactive clicker into a monitored workflow executor.
Naturally, the checklist is only as good as its decomposition and update logic. If the original instruction is ambiguous, or if the page state cannot be reliably observed, the checklist may still misclassify progress. But even that failure is more inspectable than an agent simply drifting through a website with a pleasant chain-of-thought-shaped fog machine.
Adaptive Memory: storing everything is not remembering
Agent memory has a boring name and a difficult job. The naive solution is to keep the full interaction history in context. That sounds reasonable until the context becomes bloated with low-level actions, repeated observations, and stale page states. The opposite solution is a fixed sliding window. That avoids bloat, but it forgets why earlier actions mattered.
Avenir-Web’s Adaptive Memory chooses a middle path. It keeps a recent sliding window—default size five in the paper—and recursively summarizes older interaction chunks into a persistent memory state. It also performs failure reflection: when an action fails or produces unexpected feedback, the failure is summarized immediately and preserved.
This is not memory as storage. It is memory as compression.
The distinction matters. A web agent does not need to remember every pixel it has seen. It needs to remember strategic facts: which path was tried, what failed, what constraints remain, and what page state now matters. A raw transcript is useful for debugging, but it is a poor working memory for action.
The paper also describes failure detection as a multi-layered process. The system does not only check whether a browser command technically executed. It verifies whether the page changed meaningfully: visible text, interactive elements, focus, URL, scroll position, modals, and other state changes. If a click technically happens but nothing changes, the system treats that as a failure. This is basic common sense, which is precisely why many agent systems forget to implement it.
For business users, this is where the value starts to look less like “AI magic” and more like process engineering. A reliable automation system should not only act. It should know whether the action mattered.
The benchmark result is strong, but the shape of the result is more informative
The headline number is 53.7% task success on Online-Mind2Web using Avenir-Web with Gemini 3 Pro as the main action model. That is a large improvement over the open-source baselines reported in the paper: SeeAct at 30.0%, Agent-E at 27.0%, and Browser Use at 26.0%.
The result also places Avenir-Web near several proprietary systems. It trails Yutori Navigator at 64.7%, OpenAI Operator at 58.3%, and Google Computer Use at 57.3%, but it exceeds ACT-1-20250814 at 52.7% and Claude Computer Use 3.7 at 47.3% in the paper’s table. So the accurate claim is not “Avenir-Web beats all proprietary agents.” It does not. The accurate claim is more interesting: an open-source framework, when paired with strong model components, narrows much of the gap with proprietary browser agents.
There is another nuance. The main Avenir-Web configuration is an open-source framework, not a fully open-weight stack. The paper uses Gemini 3 Pro as the primary action backbone, Claude 4.5 Sonnet for Experience-Imitation Planning, and Qwen-3-VL-8B for checklist management. The authors also report a fully open-source configuration using Qwen-3-VL-8B as the main action model, which reaches 25.7%. That is comparable to older open-source baselines, but it is far below the 53.7% main configuration.
This distinction is not pedantic. In business evaluation, “open-source” can mean at least three different things: open code, open model weights, or reproducible deployment economics. Avenir-Web is strongest as an architectural contribution. Its primary benchmark score still depends on frontier-level proprietary model capability.
The difficulty breakdown is equally important. Avenir-Web reaches 74.1% on easy tasks, 54.6% on medium tasks, and 30.3% on hard tasks. That is progress, not completion. Hard web tasks remain hard. Excellent, the machines are still bad at bureaucracy; humanity retains one competitive advantage.
The ablations show that experience and memory carry real leverage
The ablation study is the most useful part of the paper for managers and builders because it asks: which parts actually matter?
The authors run ablations on a 50-task subset of Online-Mind2Web with Gemini 3 Flash as the backbone. The full model reaches 48.0%. Removing the Task-Tracking Checklist reduces success to 44.0%. Removing MoGE drops it to 40.0%. Replacing Adaptive Memory with a fixed sliding window drops success to 42.0%, while using an infinite context window drops it to 36.0%. Removing Experience-Imitation Planning also drops success to 36.0%.
| Test | Likely purpose | Reported result | What it supports | What it does not prove |
|---|---|---|---|---|
| Main benchmark on Online-Mind2Web | Main evidence and comparison with prior work | 53.7% overall success | Avenir-Web is a strong open-source framework configuration on live web tasks | That it is production-ready for every enterprise workflow |
| Qwen-3-VL-8B configuration | Accessibility / implementation variant | 25.7% overall success | The architecture can help compact open models reach older baseline levels | That small open models can match the main Gemini-based system |
| Remove EIP | Ablation | 48.0% → 36.0% | Procedural priors are high-leverage | That online guides are always correct or safe |
| Remove Adaptive Memory / use full context | Ablation | 48.0% → 36.0% with $W=\infty$ | More context is not automatically better | That recursive summaries are optimal in all domains |
| Remove MoGE | Ablation | 48.0% → 40.0% | Hybrid grounding improves interaction reliability | That visual-first grounding solves all UI precision problems |
| Appendix trajectories | Qualitative case analysis | Successful modal/date/dropdown-style recovery examples | Mechanisms are plausible in realistic workflows | Population-level robustness beyond the benchmark |
The most revealing result is that removing EIP or effective memory causes the largest drop. That supports the paper’s central argument: web-agent reliability is not only about seeing and clicking. It is also about entering the task with procedural knowledge and preserving the right memory across steps.
The infinite-memory ablation is especially useful. It punctures a common assumption in agent design: “Just give the model more context.” Sometimes more context means more confusion. A browser trace contains plenty of junk—failed clicks, repeated observations, stale page text, partial states. Compression is not a compromise; it is part of intelligence.
The business lesson is not “replace workers”; it is “engineer recoverable workflows”
The most practical interpretation of Avenir-Web is not that companies can now unleash agents across every SaaS dashboard and go home early. That would be convenient, which is usually a warning sign.
The better interpretation is that web automation should be built as a recoverable workflow system:
| What the paper directly shows | Cognaptus inference for business use | What remains uncertain |
|---|---|---|
| Site-specific procedural planning improves benchmark performance | Internal SOPs, help docs, and workflow notes should become retrievable agent context | Whether internal documents are accurate, current, and safe enough to drive actions |
| Hybrid visual-semantic grounding improves robustness | Browser agents should combine screenshot-based action with DOM/API fallback | How well this works on proprietary enterprise UIs with authentication and custom widgets |
| Checklists reduce task drift | Each automation should maintain explicit completion criteria | Whether checklist generation can handle ambiguous business instructions |
| Adaptive memory improves long-horizon execution | Agents need compressed state and failure summaries, not raw logs only | Whether summaries preserve legally or operationally critical details |
| Failure detection checks state changes, not just command execution | Automation should verify that actions actually changed the system | Verification may be hard when the target state is subtle, delayed, or hidden |
This framing changes the ROI discussion. The value of agentic automation is not only labor substitution. It is cheaper diagnosis, fewer repeated failures, and better recovery when a workflow breaks. In many businesses, the expensive part is not the click; it is knowing why the click failed, whether the task is still safe to continue, and what evidence must be preserved before retrying.
Avenir-Web does not solve all of that, but it points in the right direction. It makes the agent’s internal state more inspectable. It decomposes goals. It records failures. It uses procedural priors. These are not glamorous features. They are the boring scaffolding that separates production automation from a demo video with suspiciously smooth editing.
The boundaries are not decorative; they define where deployment starts
The paper is unusually direct about real-world friction. In its limitations, the authors note privacy risks, harmful-action risks, security concerns around sensitive operations, grounding accuracy limits, latency, and computational overhead. These are not polite academic disclaimers pasted near the end because reviewers enjoy ritual. They are deployment constraints.
The anti-bot discussion is particularly important. The authors state that they deliberately avoided CAPTCHA bypass services and header-masking mechanisms. In their most successful run, approximately 10% of tasks—31 out of 300—were blocked by host infrastructure before any action could be performed. This is not a small detail. It means the live web is not merely an interface environment; it is also a defensive environment.
For enterprise deployment, this creates a fork.
If the agent operates on public websites, anti-bot systems, CAPTCHAs, rate limits, and website terms become part of the system design. If the agent operates on internal tools, the company can create cooperative environments: whitelisted access, audit logging, role-based permissions, safe sandboxes, and APIs for high-risk operations. The second path is less dramatic, but far more sensible.
Latency is another boundary. Avenir-Web’s architecture uses multiple models and auxiliary modules. That may be acceptable for complex tasks such as research, procurement checks, or back-office form workflows. It is less suitable for sub-second interactive use. The paper’s own limitations acknowledge that large-scale multimodal models introduce computational cost and delay.
Grounding accuracy also remains imperfect. Avenir-Web improves interaction robustness, but the paper notes that hybrid grounding remains bounded by current multimodal backbones. For high-risk workflows—payments, legal filings, medical records, regulatory submissions—“pretty good at clicking” is not a control system. Human review, transaction limits, audit trails, and deterministic APIs still matter.
Avenir-Web’s real message: agents need operating discipline
Avenir-Web is a quiet breakthrough because it does not pretend that web automation becomes reliable when the model becomes sufficiently large and poetic. Bigger models help. Of course they do. But bigger models still need operating discipline.
The paper’s mechanism-first lesson is simple:
- Give the agent procedural experience before execution.
- Let it interact visually, but recover structurally.
- Track goals as explicit observable states.
- Compress memory and preserve failures.
- Verify that actions changed the environment.
That is not the entire future of web agents. It is closer to the minimum adult supervision layer.
For Cognaptus readers, the business takeaway is practical. When evaluating web-agent automation, do not only ask which model powers the agent. Ask what the agent knows before it starts, how it grounds actions, how it tracks progress, how it remembers failures, and how it decides whether an action actually worked.
The winning systems will not be the ones that merely click faster. They will be the ones that click, notice, recover, and remember.
Avenir-Web is not the final form of browser automation. But it is a useful correction to the current habit of treating agents as large models with a mouse. A serious web agent is not just a model. It is a workflow operator with perception, procedure, state, memory, and restraint.
That sounds less exciting than “autonomous digital employee.” Good. It is also more likely to survive contact with an actual website.
Cognaptus: Automate the Present, Incubate the Future.
-
Aiden Yiliu Li, Xinyue Hao, Shilong Liu, and Mengdi Wang, “Avenir-Web: Human-Experience-Imitating Multimodal Web Agents with Mixture of Grounding Experts,” arXiv:2602.02468, 2026. https://arxiv.org/abs/2602.02468 ↩︎