An analyst opens a promising webpage. It contains the answer somewhere between a navigation menu, several years of archived material, an interactive table, related articles, legal disclaimers, and enough decorative HTML to keep a language model occupied until lunch.

A human scans, clicks, ignores, and moves on.

A browser agent is more likely to ingest the entire page, append it to an already swollen context window, and then congratulate itself for having “conducted research.”

This is the practical problem addressed by Nested Browser-Use Learning for Agentic Information Seeking, a paper from Tongyi Lab and Alibaba Group.1 Its central contribution is not another method for giving agents more access to the web. It is a method for preventing that access from overwhelming the agent trying to use it.

The paper introduces NestBrowse, a browser-use framework built around three connected design choices:

  1. Reduce browser interaction to four actions: search, visit, click, and fill.
  2. Separate task-level reasoning from page-level reading through nested outer and inner loops.
  3. Train both behaviours jointly, so the same model learns when to navigate and what to retain.

The headline benchmark results are strong. But the more useful lesson is architectural: an information-seeking agent should not treat everything it reads as part of its working memory.

The browser needs access to the page. The reasoning model usually does not.

The Real Bottleneck Is What Enters the Agent’s Memory

Most web-research agents begin with an appealingly simple toolkit:

  • search for finding candidate pages;
  • visit for retrieving their contents.

This works when the required information is visible in search snippets or available through a single static page request. It becomes incomplete when the answer sits behind client-side rendering, a form, an expandable section, an interactive calculator, or several page transitions.

Providing a real browser appears to solve the access problem. It also creates two new problems.

The first is action complexity. A conventional browser exposes many possible operations: scroll, hover, select, type, move, open, close, search within the page, switch tabs, and so forth. Each additional action expands the agent’s decision space. More control does not automatically produce better control.

The second is context pollution. A page can contain tens of thousands of tokens, and the paper notes that some can exceed one million. Passing raw page contents into a ReAct-style agent repeatedly causes the task-level reasoning context to grow with every browsing step.

Longer context windows postpone the failure. They do not remove the underlying design flaw.

An agent performing a long investigation does not need a complete textual record of every page it has opened. It needs the evidence relevant to the question, enough surrounding context to interpret that evidence, and sufficient page state to continue interacting when necessary.

NestBrowse is designed around that distinction.

Four Actions Cover Access Without Recreating the Browser

NestBrowse uses a text-only headless browser implemented with Playwright. Pages are converted from raw HTML into semantic DOM snapshots containing readable content and identifiers for interactive elements.

The agent then receives only four browser actions:

Action Operational purpose Information pathway
search Find candidate pages through batched queries Static discovery
visit Open a URL and extract information relevant to a stated goal Static or newly rendered page content
click Trigger a page transition or reveal dynamic content Dynamic information
fill Enter text into forms or editable elements Dynamic interaction

The important word is not minimal. It is complete.

search and visit cover the conventional information-retrieval pathway. click and fill add access to content and functionality that require interaction. Together, the four actions allow the agent to reach both static information and dynamic browser states without exposing every low-level browser gesture as a separate reasoning choice.

Scrolling and in-page search are deliberately absent.

That decision initially sounds strange. Scrolling is practically synonymous with browsing. But within the NestBrowse framework, scrolling only controls how much raw content becomes visible at once. It does not determine which content is relevant to the task.

The paper replaces exposure management with goal-conditioned extraction. Rather than asking the outer agent to scroll through a page and repeatedly decide what to inspect, an inner process reads the page in segments and returns only useful material.

The agent does not become better at scrolling. It becomes less dependent on scrolling as a reasoning strategy.

The Outer Loop Decides Where to Go; the Inner Loop Decides What Matters

NestBrowse divides browser use into two nested levels.

User question
Outer loop: task-level reasoning
Choose search, visit, click, or fill
If a new page is opened:
    Inner loop: inspect page segments against the current goal
    Maintain a temporary evidence workspace
    Return compact, goal-relevant information
Outer loop continues reasoning

The outer loop behaves like a conventional tool-using agent. It reasons about the overall task, selects tools, evaluates returned evidence, and decides what to do next.

The inner loop is activated when visit or click moves the browser to a new page. It receives two things:

  • the page’s raw textual content;
  • the specific goal supplied by the outer loop.

The page is divided into segments. The inner loop processes them incrementally, adding relevant material to a temporary workspace and discarding irrelevant material. Once the page has been explored, the completed workspace is returned to the outer loop.

This is more precise than simply summarising the page.

A generic summary tries to represent the page. The NestBrowse inner loop tries to support a specific investigation. The same page visited under two different goals could therefore produce two different outputs.

For example, a corporate filing might be visited with the goal of finding reported revenue, identifying litigation exposure, or verifying the appointment date of an executive. A general summary would spend scarce tokens covering all three. Goal-conditioned extraction concentrates on the evidence required for the current step.

The outer loop therefore receives a filtered information product rather than a page dump.

That design produces a useful separation:

Layer Primary question Working state
Outer loop “What must I investigate next?” Task hypotheses, accumulated evidence, next actions
Inner loop “What on this page helps with the current goal?” Page segments, relevant passages, temporary summary
Browser backend “What content and interactions are available?” Semantic DOM snapshot and interactive-element identifiers

The browser preserves access. The inner loop controls selection. The outer loop preserves reasoning capacity.

Filtering Creates a Trade-Off, Not Free Intelligence

Goal-conditioned extraction solves one problem by creating another.

Once irrelevant page content is discarded, the outer loop cannot use it later unless the page is revisited under a new goal. The system therefore depends on the quality of the goal supplied to visit or click.

A vague goal can cause the inner loop to return vaguely useful material. A prematurely narrow goal can exclude evidence that would have changed the investigation.

This is the unavoidable trade-off behind the architecture:

  • passing everything forward preserves optionality but exhausts context;
  • filtering aggressively preserves context but risks omission.

NestBrowse does not eliminate that trade-off. It trains a model to manage it.

The implementation prompts reported in the appendix reveal what the inner loop is expected to produce. For each page segment, it maintains structured fields for the rationale, extracted evidence, and a concise summary. When processing later segments, it receives previously accumulated evidence and updates the workspace incrementally.

This matters because filtering is not performed as a one-shot compression step after the entire page has entered the main context. It happens within a separate temporary workspace before the result reaches task-level reasoning.

The architecture controls information at the boundary rather than cleaning up the mess afterward.

One Model Learns Two Different Browsing Jobs

NestBrowse trains the outer and inner loops jointly through multi-task imitation learning.

The outer-loop objective teaches the model to reproduce successful investigation trajectories: reasoning, selecting tools, passing arguments, processing responses, and eventually producing the answer.

The inner-loop objective teaches the model to extract goal-relevant evidence from page segments while maintaining its temporary workspace.

The combined training objective gives equal default weighting to the two tasks. The aim is to teach a single model both how to conduct an investigation and how to read individual pages in support of that investigation.

Training trajectories are filtered through rejection sampling. The paper discards trajectories containing:

  • formatting violations;
  • hallucinated or invalid tool calls;
  • incorrect final answers.

It does not impose detailed rejection rules on intermediate reasoning or browsing behaviour. The authors argue that retaining variation in successful trajectories avoids teaching an overly brittle, manually prescribed browsing style.

This is a practical choice. Agent trajectories can reach correct answers through different sequences of queries, visits, and clicks. Filtering only for executable behaviour and successful outcomes preserves that diversity.

It also creates a boundary around the evidence. A correct final answer does not prove that every intermediate action was efficient, well-supported, or safe. The training process rewards successful browser use, not necessarily the operational qualities an enterprise deployment would require.

The Main Results Show Architecture Moving the Performance Frontier

The researchers train two models:

  • NestBrowse-4B, based on Qwen3-4B-Thinking-2507;
  • NestBrowse-30B-A3B, based on Qwen3-30B-A3B-Thinking-2507.

Both operate with a maximum context length of 128,000 tokens and a limit of 100 tool invocations. Training required approximately 1,344 H20 GPU-hours for the 4B model and 4,096 H20 GPU-hours for the 30B-A3B model.

The models are evaluated on four difficult web-based question-answering benchmarks: BrowseComp, BrowseComp-zh, the 103-question text-only subset of GAIA, and XBench-DeepSearch. Final-answer accuracy is judged using GPT-4.1 under each benchmark’s official evaluation prompt.2

Selected results show where NestBrowse is strong and where larger proprietary systems remain ahead:

Model Web toolkit BrowseComp BrowseComp-zh GAIA XBench 2505 / 2510
WebSailor-V2-30B-A3B-SFT Search, visit 24.4 28.3 66.0 61.7 / —
WebLeaper-30B-A3B-RU Search, visit 23.0 67.0 66.0 / —
NestBrowse-4B Text browser 22.4 28.4 68.9 74.0 / 38.0
NestBrowse-30B-A3B Text browser 31.6 42.6 75.7 75.0 / 45.0
OpenAI-o3 Browser 49.7 58.1 70.5 66.7 / —

NestBrowse-30B-A3B leads the open-source systems listed in the paper across the reported benchmarks. The 4B version also exceeds many substantially larger open-source agents, particularly on GAIA and XBench.

The result does not justify the convenient slogan that architecture simply “beats scale.” OpenAI-o3 and OpenAI DeepResearch remain considerably stronger on BrowseComp, while o3 also leads on BrowseComp-zh.

The more defensible conclusion is that interaction architecture can substantially improve what a given model scale can accomplish.

NestBrowse changes the performance frontier. It does not abolish it.

The results on Chinese benchmarks are also notable because the models were trained only on English data. That suggests some transfer of browsing and information-seeking behaviour across languages. It does not establish broad multilingual reliability, but it indicates that the learned interaction pattern is not entirely tied to the training language.

The Ablation Shows That Reading Discipline Matters More Than Tool Minimalism Alone

The paper’s most informative experiment is not the main leaderboard. It is the browser-strategy ablation.

To isolate the effects of the framework’s design choices, the researchers run four browser-use settings with the same GPT-OSS-120B agent model:

Setting Simplified toolkit Goal-relevant extraction GAIA XBench
Naive No No 46.6 40.0
Simplified Yes No 55.3 40.0
Compressed No Yes 60.2 61.0
NestBrowse Yes Yes 73.8 71.0

This is an ablation study: its purpose is to identify which system components account for the improvement.3

Simplifying the action space helps GAIA, raising accuracy from 46.6 to 55.3, but makes no difference on XBench in this experiment.

Goal-relevant extraction produces a larger independent improvement: GAIA rises to 60.2, while XBench rises from 40.0 to 61.0.

Combining both components yields the best result by a substantial margin: 73.8 on GAIA and 71.0 on XBench.

The implication is more specific than “simple tools are good.”

A smaller toolkit reduces decision burden. But the larger problem is the amount of irrelevant information returned after a tool has been selected. Reducing the action space helps the agent choose. Filtering page content helps it continue thinking after the choice.

The combined design improves both sides of the interaction boundary:

  • fewer ways to ask the browser to act;
  • less irrelevant material returned after it acts.

For enterprise system designers, that distinction is important. Redesigning tool schemas without redesigning tool responses addresses only half the problem.

The Context Analysis Shows Why the Nested Loop Exists

The paper’s context-efficiency analysis follows NestBrowse-30B-A3B across tool-call turns on a BrowseComp subset.

It compares two quantities:

  • the total amount of page information processed by the complete nested system;
  • the much smaller context retained in the outer reasoning loop.

After approximately 20 tool calls, the cumulative information processed already exceeds the model’s 128,000-token context limit. At that point, roughly 85% of the trajectories remain active.

Without filtering, those investigations would be unable to continue under the same context constraint. With NestBrowse, the outer-loop context remains below the limit even as the combined system processes far more information over later turns.4

This result should not be misread as evidence that “85% of tasks fail from context exhaustion.” The figure does not run a separate unfiltered system and count its observed failures. It shows that, under a raw-information interpretation, the context limit would be crossed while most investigations were still unfinished.

The experiment supports the mechanism. It does not directly measure production latency, token billing, or completion-rate improvement under a commercial deployment.

Even so, it captures a genuine operational problem. Context is not merely a storage allowance. Larger contexts increase inference cost and can make the model reason over more irrelevant material. A system that processes 700,000 tokens of source material does not necessarily need 700,000 tokens in its active decision state.

The nested architecture functions like a research organisation in miniature. Page-level readers inspect documents and send relevant evidence to the lead analyst. The lead analyst does not sit beside the photocopier reading every page in the order it arrives.

The Inner Reader Can Become the System’s Bottleneck

A second analysis keeps NestBrowse-30B-A3B as the outer-loop model while swapping the model used inside the page-exploration loop.

On a 100-example BrowseComp subset, performance changes as follows:

Outer-loop model Inner-loop model BrowseComp
NestBrowse-30B-A3B NestBrowse-4B 24.0
NestBrowse-30B-A3B NestBrowse-30B-A3B 35.0
NestBrowse-30B-A3B GPT-OSS-120B 36.0

This is a component-sensitivity test rather than a headline benchmark.

Its message is straightforward: a strong task-level reasoner cannot recover evidence that its page-level reader failed to retain.

Using the smaller 4B model inside the inner loop reduces performance sharply. Replacing the inner model with the much larger GPT-OSS-120B improves the score by only one point over NestBrowse-30B-A3B in this particular test.

That pattern suggests two practical conclusions.

First, the inner loop is not a trivial summarisation utility that can automatically be assigned to the cheapest available model. It performs a consequential evidence-selection task.

Second, larger is not automatically economical. Once page-reading quality reaches a sufficient level, additional model scale may produce diminishing gains. The correct inner-loop model therefore has to be selected through task-specific evaluation, not by assigning it the smallest deployment budget and hoping for the best.

The paper also evaluates inner-loop outputs for raw snapshot retention and goal-relevant extraction using GPT-4.1 as judge. NestBrowse models improve over their respective base models on both dimensions across the tested benchmarks.

That supports the claim that joint training improves page exploration. Because these metrics depend on an LLM judge rather than independently labelled ground truth, they should be treated as supporting analysis, not a definitive measurement of extraction fidelity.

The Calculator Case Shows the Browser Becoming a Meta-Tool

One case study illustrates what click and fill add beyond static information retrieval.

Faced with a GAIA question involving Newton’s method, NestBrowse searches for an online calculator, visits it, fills in the function and initial value, clicks the calculation button, and uses the returned result.

The example is modest, but its implication is useful.

A browser is not only a document reader. It is also an interface to calculators, converters, databases, simulators, search forms, and other specialised utilities. By operating those interfaces, an agent can use tools that were never explicitly integrated into its original function library.

The paper describes this as meta tool-use: browser access gives the agent potential access to describes this as meta tool-use: the tools embedded throughout the web.

This case is exploratory evidence, not proof of reliable general browser automation. Successfully operating one public calculator does not establish that the system can safely complete purchases, submit regulated filings, navigate authenticated enterprise applications, or recover from unpredictable interface changes.

Still, it demonstrates why search and visit alone are incomplete. Sometimes the required information does not exist until a webpage performs an action.

What Businesses Can Infer From the Paper

The paper directly demonstrates improved performance on deep information-seeking benchmarks, stronger results from combining tool simplification with goal-conditioned extraction, and better context management through nested page exploration.

The following business implications are inferences from that evidence rather than outcomes directly tested by the authors.

Separate research memory from page-reading memory

Many agent systems allow search results, retrieved documents, tool outputs, and intermediate reasoning to accumulate in one conversation history. NestBrowse suggests a cleaner boundary.

The task-level agent should retain:

  • the current objective;
  • working hypotheses;
  • verified evidence;
  • unresolved questions;
  • decisions about the next action.

Raw webpages and temporary extraction state should live elsewhere.

This pattern is applicable to competitive intelligence, policy monitoring, compliance research, technical support, and due-diligence workflows. In each case, the value comes from keeping the central investigation coherent while allowing specialised components to inspect large sources.

Make every page transition carry an explicit goal

NestBrowse requires visit and click actions to include a goal because the inner loop needs to know what to extract.

That design can improve observability in business systems. A page visit is no longer logged merely as “opened URL.” It can be recorded as “opened URL to verify the effective date of the regulation” or “opened product page to identify stated warranty exclusions.”

Explicit goals make extraction behaviour easier to test and audit.

They also expose failure more clearly. When relevant evidence is missed, reviewers can distinguish between a poor page reader and a poorly specified goal.

Preserve raw evidence outside the reasoning context

NestBrowse filters information before it reaches the outer loop. In a commercial or regulated setting, discarded material may still need to remain accessible for audit, reprocessing, or changed investigative goals.

A practical implementation could maintain three layers:

Storage layer Contents Purpose
Raw evidence archive Original snapshots, timestamps, URLs, interaction states Audit and replay
Page-level workspace Goal-conditioned extracts and summaries Evidence preparation
Reasoning context Selected evidence and task decisions Efficient investigation

This preserves the context-efficiency benefit without treating filtered-out content as permanently irrelevant.

Evaluate tool responses as carefully as tool selection

Agent-development discussions often focus on whether the model chose the correct function. The ablation indicates that what the function returns can matter even more.

A technically correct visit action can still damage the investigation if it returns an enormous page dump. A concise result can also be harmful if it omits the decisive evidence.

Tool evaluation should therefore include response quality:

  • Did the tool preserve the required evidence?
  • Did it include enough context to interpret that evidence?
  • Did it avoid flooding the task-level agent?
  • Can the raw source be recovered when needed?

The most elegant tool schema in the meeting room remains useless if every call returns the textual equivalent of a storage unit emptied onto the floor.

What the Paper Does Not Yet Establish

NestBrowse is evaluated as a text-only information-seeking system. Its browser backend exposes textual page content and semantic DOM identifiers, but it does not use visual information.

This excludes important real-world cases in which meaning is carried by charts, scanned documents, layout, images, maps, video, or visual state changes. Adding multimodal perception would increase both capability and complexity; the paper intentionally leaves that problem open.

The evaluation also focuses on web-based question answering. It does not test authenticated corporate systems, transactional reliability, access controls, browser security, prompt injection, or the consequences of clicking the wrong element.

Other practical uncertainties remain:

  • The paper reports accuracy but not end-to-end latency.
  • It does not report inference cost or token-cost savings.
  • It does not measure extraction recall against human-labelled evidence.
  • Final answers and inner-loop quality analyses rely on GPT-4.1 judging.
  • The training process requires substantial GPU time despite the relatively small model sizes.
  • The calculator case demonstrates possibility, not dependable execution across arbitrary web applications.

These boundaries do not weaken the paper’s central result. They define where it applies.

NestBrowse provides evidence that carefully designed browser abstraction and information flow can improve deep web research. It does not yet provide a production blueprint for every activity performed inside a browser.

Better Browsing Begins With Refusing to Read Everything

The common intuition is that a more capable browser agent needs more: more tools, more page access, more context, and usually a larger model to manage the resulting complexity.

NestBrowse offers a more disciplined replacement.

Give the agent enough actions to reach the information. Separate page exploration from task reasoning. Filter evidence before it enters working memory. Train navigation and reading as related but distinct behaviours.

The paper’s strongest contribution is not proving that small models are universally sufficient. It is showing that model capacity can be wasted by poor information plumbing.

A browser agent does not become intelligent by copying more of the internet into its prompt.

Sometimes it improves by learning what not to bring back.

Cognaptus: Automate th Present, Incubate the Future.


  1. Baixuan Li et al., “Nested Browser-Use Learning for Agentic Information Seeking,” arXiv:2512.23647, 2025, https://arxiv.org/abs/2512.23647↩︎

  2. Ibid., Section 4.1 and Table 1. ↩︎

  3. Ibid., Section 4.3 and Table 2. ↩︎

  4. Ibid., Sections 4.4–4.6, Figures 3–5, and Table 3. ↩︎