When Agents Browse Back: Why Multimodal Search Still Fails the Real Web

Search looks easy until the answer is hiding in a caption, a cropped image region, a second web page, and one annoyingly necessary intermediate clue.

That is the problem BrowseComp-V3 is trying to measure.¹ Not whether a multimodal model can recognize an object in an image. Not whether a chatbot can summarize the first search result. Not whether a web agent can click around long enough to look busy. The benchmark asks a more operationally relevant question: can an AI system browse the open web, combine text and visual evidence across multiple steps, and still arrive at the right answer?

The answer, for now, is mostly no.

The paper reports that human participants using a browser reach a 68.03% success rate and an 82.93% process score. The best tool-augmented model service in the reported table, GPT-5.2-Thinking, reaches 39.13% success rate and 66.05% process score. GPT-5.2 inside the authors’ transparent OmniSeeker framework reaches 36.00% success rate and 57.70% process score. Tool-free multimodal models perform far worse, often in the single digits or low teens.

So the convenient story — “just give the model search tools” — is dead on arrival. It was a pleasant story. Briefly. Like most pleasant stories in AI operations, it collapses once measurement becomes less forgiving.

The more useful conclusion is sharper: multimodal browsing agents are not mainly failing because information is unavailable. BrowseComp-V3 is explicitly designed so that supporting evidence is publicly searchable. They fail because the evidence must be found, visually grounded, connected across modalities, and preserved through a long reasoning chain. Retrieval is only the entrance fee.

The evidence gap is larger than the tooling story suggests

BrowseComp-V3 contains 300 hand-crafted questions, 383 images, five primary domains, and 24 secondary domains. The tasks are designed around open-world multimodal browsing: the agent receives a question and often image input, then must use search and browsing tools to locate public evidence and produce a concise answer.

The benchmark’s central design choice is not “more questions.” It is friction. The paper deliberately constructs tasks where critical evidence is interleaved across text and image layers, sometimes within the same page and sometimes across multiple pages. A model may need to identify a visual clue, search for related information, inspect a webpage, crop an image region, disambiguate candidates, and only then answer.

This is why the benchmark is more interesting than a normal visual question-answering test. In many older multimodal benchmarks, the hard part is local perception: look at the image, extract the answer. BrowseComp-V3 instead tests whether perception can survive a browsing workflow.

The authors compare BrowseComp-V3 against prior benchmarks across dimensions such as multimodality, multi-round interaction, image-based thinking, public-search answerability, hop-based difficulty analysis, human-validated trajectories, and fine-grained progress metrics. That comparison functions as a benchmark-positioning test, not as the main performance evidence. Its purpose is to show why existing evaluations may overestimate readiness: they often measure final answers, shallow retrieval, or non-public evidence conditions, but not reproducible long-horizon multimodal browsing.

The main evidence comes later, in the model evaluation.

Setting	Representative result	What it shows	What it does not prove
Human browser baseline	68.03% success rate; 82.93% process score	The tasks are hard but solvable with public web evidence	Humans are not perfect; web search remains cognitively costly
Tool-free MLLMs	Best reported tool-free success rate is 12.00% for Gemini-3-Flash-Preview	Parametric knowledge alone is inadequate for these tasks	It does not isolate whether failure comes from missing facts, visual perception, or reasoning
Tool-augmented model services	GPT-5.2-Thinking reaches 39.13% success rate	Built-in browsing and reasoning help substantially	Tool access still leaves a large human gap
OmniSeeker framework	GPT-5.2 reaches 36.00%; Doubao-Seed-1.8 reaches 33.67%	Transparent standardized tooling can lift many models	A better wrapper does not eliminate integration failures

The gap between success rate and process score is especially important. A model may complete several subgoals yet still fail the final answer. That is not the same as random hallucination. It is more like a junior analyst who finds three relevant documents, highlights the right chart, then somehow writes the wrong conclusion. Progress happened. Reliability did not.

BrowseComp-V3 formalizes this with process-level evaluation. For each task, expert-defined subgoals describe the necessary intermediate steps. The process score measures the proportion of those subgoals achieved:

$$ ProcessScore(q) = \frac{|\hat{G}_q|}{|G_q|} $$

where $G_q$ is the set of required subgoals for question $q$, and $\hat{G}_q$ is the subset the model successfully achieves.

This is not just an academic metric. It changes how failures are interpreted. Final-answer accuracy tells you whether the system landed. Process score tells you where the aircraft started losing altitude.

Public evidence does not make the web easy

A common misconception about browsing agents is that failure mostly comes from missing information. If the evidence exists online, and the model has search tools, then surely the remaining problem is just better retrieval.

BrowseComp-V3 is useful because it blocks that excuse. The benchmark requires supporting evidence to be publicly searchable. The authors also emphasize temporal stability and objective answers, reducing the risk that models fail because the web changed overnight or because the answer is subjective.

That design turns the benchmark into a cleaner diagnostic test. If an agent fails, the failure is less likely to be “the source was inaccessible” and more likely to be one of four operational weaknesses:

Failure layer	What goes wrong	Business translation
Search formulation	The agent searches the wrong phrase or fails to exploit a clue	The system cannot turn vague user intent into evidence-seeking queries
Visual grounding	The agent sees the image but attends to the wrong region	Screenshots, diagrams, product photos, maps, and scanned documents become unreliable inputs
Cross-modal integration	The agent cannot connect image evidence with textual evidence	The system finds facts but cannot assemble them into a decision
Long-horizon planning	The agent loses the thread across many tool calls	The workflow degrades as task length increases

The paper’s failure-mode analysis identifies visual grounding and perception failure as dominant error sources across models. Closed-source frontier systems reduce some perception and grounding errors, but as those improve, long-horizon planning becomes a more visible bottleneck.

That progression matters. In early multimodal AI, the question was often “can the model see?” In enterprise browsing agents, the question becomes “can the model keep track of what it saw, why it searched, what it ruled out, and what still needs to be verified?”

This is less glamorous than a demo. It is also where production systems usually break.

OmniSeeker helps because structure helps, not because wrappers are magic

The paper also introduces OmniSeeker, a unified multimodal browsing agent framework. It equips models with a standardized set of tools: text search, webpage visit, image search, image crop, and reverse image search. The framework limits interactions to 20 rounds per question and uses a retrieval setup where search returns top results, webpages are parsed, and images can be embedded or cropped for model inspection.

OmniSeeker’s contribution is practical: it makes the browsing process more transparent and comparable across models. In the reported results, OmniSeeker substantially improves many models over their tool-free versions. GPT-5.2 rises from 6.00% success rate tool-free to 36.00% with OmniSeeker. Doubao-Seed-1.8 rises from 9.00% tool-free to 33.67% with OmniSeeker. Claude-Sonnet-4.5 rises from 4.00% to 22.67%.

That is a major improvement. It is also not a cure.

The correct business reading is not “build an OmniSeeker-like wrapper and the problem is solved.” The better reading is: structured tool orchestration is a necessary reliability layer, but it does not replace multimodal reasoning.

A transparent browsing framework helps in three ways.

First, it standardizes what the model can do. Without that, evaluation mixes model capability, product interface quality, tool availability, and hidden vendor orchestration into one blurry number.

Second, it makes traces inspectable. If the agent fails, you can examine whether it searched poorly, visited the wrong page, cropped the wrong image region, or lost track of a subgoal.

Third, it creates a training and improvement surface. Subgoal traces can support reinforcement learning, supervised fine-tuning, or workflow-specific correction rules. A black-box answer gives you a grade. A trace gives you a repair manual.

But even under OmniSeeker, the best success rate reported for GPT-5.2 is 36.00%. That is not a small rounding error away from human reliability. It is a capability gap large enough to matter in procurement, workflow design, and compliance review.

The fine-grained tests diagnose bottlenecks rather than adding a second thesis

The paper’s further analysis should not be read as a separate argument. It is mainly diagnostic: the authors are asking where capability degrades after the headline results establish that it does degrade.

The task-level analysis separates problems by cross-modal complexity. Level 1 tasks involve more unitary visual search. Level 2 and Level 3 require stronger inter-region integration and inter-image relational reasoning. In Table 3, process scores generally decline from Level 1 to Level 2 and remain difficult at Level 3. For example, Claude-Sonnet-4.5 drops from 0.5708 at Level 1 to 0.5353 at Level 2 and 0.5186 at Level 3. Qwen3-VL-235B drops from 0.3262 at Level 1 to 0.2308 at Level 2 and 0.2715 at Level 3.

This test supports the mechanism: difficulty is not only about searching longer. It is about integrating visual evidence across increasingly complex relationships.

The search-depth analysis plays a different role. It examines how performance changes as browsing paths become longer. The paper reports that both humans and models decline with increasing search depth, but in different patterns. Human performance drops sharply on longer paths, likely because attention and cognitive load become limiting. Model performance declines more gradually, which the authors interpret as possible compensation through internal parametric knowledge.

That is an interesting interpretation, but it should be treated carefully. It suggests models may sometimes appear less sensitive to search length because they lean on memorized or inferred knowledge. In production, that is not automatically good. A system that fills gaps with plausible prior knowledge can look robust until the task requires exact evidence.

The ability-boundary analysis is more directly operational. Humans are limited more by text search burden; models are limited more by multimodal integration. In other words, people get tired reading too much. Models get confused connecting what they read with what they saw. Neither is noble. One is at least easier to manage with better interfaces.

The test-time scaling analysis is best read as an exploratory extension or sensitivity test. Increasing interaction turns improves performance, especially for larger models. Increasing independent samples also helps, with Best-of-N scaling more effectively than other aggregation strategies in the reported Qwen3-VL-235B analysis. This supports a familiar but often ignored point: more compute helps only when the system can use it productively. Giving a weak agent more turns may simply produce a longer mistake.

Finally, the failure-mode analysis functions as the paper’s mechanism check. It connects the quantitative gap to concrete failure categories: grounding, perception, and planning. That is the section enterprise readers should care about most, because failure categories map more directly to engineering controls than benchmark scores do.

What businesses should actually take from this

BrowseComp-V3 does not say that browsing agents are useless. It says they are diagnosable. That is much more valuable.

For enterprise AI, the paper supports a shift from answer-only evaluation to process-aware evaluation. A procurement test that asks an AI agent 50 questions and records only final accuracy is under-instrumented. It may tell you which vendor looks good on a leaderboard-like task. It will not tell you whether the system failed because of poor search, weak visual grounding, bad page parsing, inadequate memory, or final-step reasoning collapse.

A better enterprise evaluation should separate at least four layers:

Evaluation layer	Practical question	Example evidence
Evidence accessibility	Could the required evidence be found through approved sources?	Search logs, source whitelist coverage, retrieval recall
Subgoal progress	Did the agent complete necessary intermediate steps?	Expert or automated subgoal scoring
Cross-modal grounding	Did the agent correctly map text to images, screenshots, charts, or regions?	Visual grounding checks, crop traces, OCR comparisons
Final integration	Did the system combine evidence into the right conclusion?	Answer accuracy, explanation audit, contradiction checks

This distinction matters in different industries for different reasons.

In financial research, a browsing agent may collect filings, market commentary, and chart images. The risk is not simply that it misses a source. The risk is that it finds the right source but connects the wrong figure to the wrong company, period, or metric.

In legal and compliance workflows, the agent may need to connect a regulation, a clause, and an external document. A final answer without traceable subgoals is hard to audit. “The model said so” remains a poor compliance architecture. Astonishingly, regulators have not yet accepted vibes as a control framework.

In procurement, the benchmark suggests that buyers should test agent systems on realistic multi-step tasks rather than polished demos. The key question is not “can the model browse?” The key question is “can it preserve evidence quality across a chain of search, perception, disambiguation, and reasoning?”

In internal operations, BrowseComp-V3 also suggests a more realistic workflow design. Human-in-the-loop review should not wait until the final answer. Review should be inserted at high-risk subgoals: entity identification, image-region interpretation, source selection, and final synthesis. This reduces review cost because humans inspect the fragile joints rather than the entire browsing transcript.

The boundary: this benchmark diagnoses capability, not ROI

The paper’s limitations are straightforward and important.

BrowseComp-V3 has 300 carefully curated tasks. That is large enough to expose meaningful failure patterns, but it is not a direct simulation of every enterprise domain. A company evaluating agents for insurance claims, supply-chain monitoring, investment research, or medical operations would still need domain-specific tests.

The benchmark also measures task success and process progress, not production economics. It does not directly estimate latency, tool cost, vendor lock-in, data governance burden, user trust, or the operational cost of human review. Those are business variables, not benchmark variables.

The public-searchability requirement is a strength for fairness and reproducibility, but it also means the benchmark does not fully represent private enterprise environments where evidence may sit inside internal databases, PDFs, emails, dashboards, and permissioned systems. The same failure modes may appear there, but the access layer becomes more complicated.

Finally, test-time scaling results should not be casually converted into a budget recommendation. More interaction turns and Best-of-N sampling can improve outcomes, but they also increase cost and latency. In production, the question is not whether more compute helps. The question is where extra compute buys enough reliability to justify itself.

The real lesson is not that agents cannot browse

BrowseComp-V3 is useful because it draws a boundary between web access and web competence.

A model with tools can search. It can open pages. It can inspect images. It can crop regions. It can generate a confident final answer. None of that guarantees that it has built a stable evidence chain.

The paper’s evidence-first message is therefore uncomfortable but productive: multimodal browsing agents are already capable enough to make partial progress, yet unreliable enough that final answers remain fragile. They are not useless assistants. They are not autonomous analysts either. They occupy the messy middle, where operational design matters.

For businesses, that middle is where the work begins. The near-term opportunity is not to pretend agents can replace evidence-heavy workflows end-to-end. The opportunity is to build systems that make agent reasoning inspectable, score progress before final answers, and place human review exactly where multimodal integration is most likely to break.

The web is not just a pile of documents. It is a messy network of text, images, layouts, captions, screenshots, metadata, and context. BrowseComp-V3 shows that today’s agents can move through that network, but they do not yet understand it as a system of interdependent signals.

They browse. They sometimes browse impressively.

They still need supervision when the web browses back.

Cognaptus: Automate the Present, Incubate the Future.

Huanyao Zhang et al., “BrowseComp-$V^3$: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents,” arXiv:2602.12876, submitted February 13, 2026 and revised February 24, 2026. https://arxiv.org/abs/2602.12876 ↩︎

The evidence gap is larger than the tooling story suggests#

Public evidence does not make the web easy#

OmniSeeker helps because structure helps, not because wrappers are magic#

The fine-grained tests diagnose bottlenecks rather than adding a second thesis#

What businesses should actually take from this#

The boundary: this benchmark diagnoses capability, not ROI#

The real lesson is not that agents cannot browse#