Search is now where many AI demos go to become either useful products or expensive browser cosplay.
A model that answers from memory can look impressive for five minutes. A model that can search, compare, verify, follow clues, abandon bad paths, and synthesize a final answer is much harder to fake. That is why “deep research” has become one of the more important capability battles in AI. It is also why the battle has been awkwardly closed. Many labs release weights, leaderboards, and cinematic launch posts. Far fewer release the thing that actually teaches the agent how to search: the training data.
A March 2026 paper from Shanghai Jiao Tong University, OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data, attacks exactly that bottleneck.1 The authors introduce OpenSeeker, a deep-search agent trained with simple supervised fine-tuning on 11.7k synthesized samples. The headline is easy to remember: OpenSeeker releases not only the model, but also the training data and the synthesis pipeline. The deeper point is more useful: the paper is less about “one more agent” and more about how to manufacture search experience.
That distinction matters. A normal reading of the paper would jump straight to the benchmark table: OpenSeeker scores 29.5 on BrowseComp, 48.4 on BrowseComp-ZH, 74.0 on xbench-DeepSearch, and 59.4 item F1 on WideSearch. Those results are impressive for a 30B-class SFT-only model. But the benchmark table is not the engine. It is the receipt.
The engine is a two-part data recipe:
- first, construct hard questions from the topology of the web itself;
- then, generate cleaner expert trajectories while forcing the final student model to train on raw noisy observations.
That sounds like a small technical detail. It is not. It is the difference between training an agent on “answers” and training it on the kind of frustrating, clue-following work that real search requires. In business terms, OpenSeeker suggests that the defensible asset in agentic search may not be the largest model or the most elaborate reinforcement-learning pipeline. It may be the ability to generate difficult, verifiable, domain-specific search tasks at scale.
Naturally, this is bad news for anyone whose AI strategy is “we will add an agent button.” Those buttons are multiplying. Competence is not.
The monopoly is not the search box; it is the training experience
The paper begins from a practical observation: deep search agents have advanced quickly, but high-performance training remains concentrated among companies with closed data pipelines. The public often sees the final interface. Researchers may see open weights. Builders usually do not see the data that shaped the agent’s search behavior.
That creates a strange form of openness. A model can be downloadable but still not reproducible. The weights are visible; the recipe is missing. It is like receiving a beautiful cake and being told this is a major step toward open-source baking.
OpenSeeker’s contribution is therefore not merely “another open model.” The paper positions itself against a data moat. The authors argue that frontier search agents require high-quality long-horizon trajectories, and that existing open efforts often have one of three problems: they lack full data release, provide only partial data, or fail to reach competitive performance. OpenSeeker tries to close this gap by releasing the model weights, the 11.7k-sample training dataset, and the synthesis solution.
The misconception worth removing is simple: frontier search agents do not necessarily require only massive proprietary data, RL-heavy training, or industrial-scale closed infrastructure. They may require carefully designed experience. The word “carefully” is doing real work here. Random synthetic data is cheap. Useful synthetic data is engineered.
The paper’s mechanism-first story can be summarized as follows:
| Problem in search-agent training | OpenSeeker’s mechanism | Why it matters operationally |
|---|---|---|
| Easy questions can be solved by memory or keyword lookup | Generate questions from web graph structures and obfuscate entities | Forces multi-hop search rather than shallow retrieval |
| Synthetic questions can be invalid or hallucinated | Reject samples that are either too easy or not solvable with oracle context | Keeps difficulty without breaking correctness |
| Raw web observations are noisy and long | Use denoised context for teacher trajectory generation | Produces cleaner expert actions |
| Real inference still faces raw web pages | Train the student on raw observations while imitating teacher actions | Teaches the model to extract signal from noise |
This is the core. OpenSeeker is not betting that the base model magically becomes a better searcher. It is betting that search ability can be induced by constructing the right kind of training struggle.
Step one: build questions from the web graph, not from vibes
The first technical move is fact-grounded scalable controllable QA synthesis. The phrase is long, but the idea is clean.
Instead of asking an LLM to invent questions in the abstract, OpenSeeker starts from the web as a directed graph. Pages are nodes. Hyperlinks are edges. A seed page is expanded into a local subgraph of connected pages. From that subgraph, the pipeline extracts entities and relationships, forming a condensed entity subgraph. Then it generates a question whose answer requires traversing multiple links in that structure.
The important inversion is this:
OpenSeeker does not begin with a question and then look for evidence. It begins with evidence paths and then constructs questions that require those paths.
That inversion solves two common problems in synthetic training data.
The first problem is factual drift. LLM-generated questions can sound plausible while being weakly grounded. By anchoring question construction in real web topology, OpenSeeker reduces the chance that the task is merely a hallucinated puzzle wearing a research costume.
The second problem is shallow solvability. Many generated questions contain the exact entity names needed for search. If the model can paste a keyword into a search tool and win, the training sample does not teach deep search. It teaches copy-paste with extra tokens.
OpenSeeker handles this by obfuscating entities. Concrete names are converted into vague descriptions while preserving the underlying reasoning path. In other words, the system removes the shortcut while keeping the answer stable. The agent must identify what the vague clue refers to, follow related pages, and combine multiple pieces of evidence.
For business readers, this is the first useful lesson. If you want to train a procurement agent, a compliance agent, or an investment research agent, the valuable training tasks are probably not generic Q&A pairs. They are structured search problems built from the actual relationship graph of your domain: suppliers, contracts, regulations, filings, products, competitors, alerts, and exceptions.
A domain agent does not become competent because it has read a pile of documents. It becomes competent when it has practiced solving the kinds of ambiguous, multi-hop problems that the domain keeps producing. This is less glamorous than “agentic AI.” It is also more likely to work.
Verification is where synthetic data stops being wishful thinking
OpenSeeker’s question generation would be risky if it stopped at generation. It does not. The paper adds dual-criteria verification through rejection sampling.
The two filters are simple and important:
| Verification criterion | Test | Rejected when | Purpose |
|---|---|---|---|
| Difficulty | A strong model answers without tools | The closed-book answer is correct | Remove questions that do not require search |
| Solvability | A model answers with full entity-subgraph context | The oracle-context answer is wrong | Remove broken or inconsistent questions |
This is an elegant design because it separates two properties that are often confused. A good training question should be hard, but not impossible. If it is answerable from memory, it does not train search. If it is not answerable even with the relevant context, it trains confusion.
The business implication is also blunt. Synthetic data is not valuable because it is synthetic. It is valuable when the generation process is paired with tests that reject useless samples. Many enterprise AI projects fail quietly here. They create large internal instruction datasets, but the tasks are either too easy, too vague, or not checkable. Then the model is fine-tuned, disappoints everyone, and someone blames the base model. Sometimes the base model deserves blame. Often the dataset was just a motivational poster with JSON formatting.
OpenSeeker’s verification design gives a better pattern: generate from structure, then filter by behavioral tests. The question should defeat memory-only answering while remaining solvable with the right evidence. That is a practical standard a business team can adapt.
Step two: let the teacher see clean context, then make the student survive raw context
The paper’s second mechanism is denoised trajectory synthesis. This is where OpenSeeker becomes more interesting than a graph-based Q&A generator.
Search-agent training needs trajectories: sequences of reasoning steps, tool calls, observations, and final answers. The problem is that raw web observations are messy. They contain irrelevant text, repeated navigation material, partial clues, distractors, and enough noise to make even a patient analyst briefly consider farming.
If a teacher model generates trajectories while seeing the entire raw history, the context becomes long and polluted. If the system compresses too aggressively, it may remove useful evidence. OpenSeeker handles this with a “summarized history + raw recent” protocol during trajectory synthesis.
The decision phase gives the teacher full access to the most recent raw observation, so fresh details are not prematurely lost. Then, after that step, a summarizer compresses the previous observation into a cleaner summary for long-term history. This rolling denoising lets the teacher generate higher-quality actions over long horizons.
So far, this is helpful but not unusual. Many systems summarize context.
The clever part is the asymmetry.
During synthesis, the teacher benefits from denoised context. During training, the student receives raw, uncompressed tool responses and is supervised to imitate the teacher’s expert actions. The student therefore learns to make clean decisions from dirty evidence.
That is a useful kind of unfairness. The teacher gets a tidy office. The student gets the inbox.
This design matters because real deployment is closer to the student setting. A production search agent will not always receive pristine summaries. It will see broken pages, irrelevant excerpts, duplicate snippets, malformed tables, SEO debris, and documents written by people who believe headings are optional. Training on raw observations while imitating actions derived from cleaner teacher context pushes the model to internalize denoising rather than depend on a separate summarizer at every moment.
For enterprise teams, the lesson is not “always copy OpenSeeker’s exact pipeline.” The lesson is to distinguish generation convenience from deployment reality. You may use cleaner scaffolds to generate expert demonstrations, but the model should train against the messy input distribution it will actually face. Otherwise the agent learns a beautiful workflow that collapses the moment a PDF has footnotes, tables, and three versions of the same policy.
Why 11.7k samples can matter more than 147k samples
The paper trains OpenSeeker-v1-30B-SFT from Qwen3-30B-A3B-Thinking-2507. The model has 30B total parameters with 3B activated during prediction. The authors use a context window of 256k and set a maximum of 200 tool calls. Each training sample contains the user question plus raw reasoning steps, tool calls, and full uncompressed tool responses. Because of resource constraints, they report a single training run without heuristic data filtering or hyperparameter tuning.
This setup is important because the paper is not claiming victory through massive iteration. The authors are effectively saying: with one SFT run and 11.7k high-fidelity samples, the agent can become competitive with much more resource-intensive systems.
That claim is supported by several result categories. They are not all the same kind of evidence.
| Evidence in the paper | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Table 1: broad comparison across proprietary, large open-source, and 30B models | Main evidence and comparison with prior work | OpenSeeker is unusually competitive for a fully open, SFT-only 30B-class agent | It does not show OpenSeeker beats all frontier closed systems |
| Table 2: SFT-only comparison among 30B-class agents | Main evidence isolating training setup | Under simple SFT, OpenSeeker performs best among listed comparable models across the reported benchmarks | It does not isolate every possible architecture or base-model difference |
| Table 3: comparable data-volume comparison against WebSailor/WebLeaper configurations | Comparison with prior work and data-efficiency evidence | Similar or smaller data volume can produce stronger results if synthesis quality is higher | It does not prove the exact contribution of each OpenSeeker data component individually |
| Figures 4–5: tool-call and token-count difficulty analysis against BrowseComp variants | Dataset difficulty analysis | OpenSeeker’s synthesized tasks are long-horizon and difficult, especially in Chinese | Difficulty metrics alone do not equal usefulness |
| Appendix A comparison with OpenResearcher and REDSearcher | Concurrent-work comparison | OpenSeeker is positioned as more open and more data-efficient than selected concurrent efforts | It does not settle all future open-agent comparisons |
The numbers deserve careful reading.
In the broad comparison, OpenSeeker scores 29.5 on BrowseComp, 48.4 on BrowseComp-ZH, 74.0 on xbench-DeepSearch, and 59.4 on WideSearch item F1. It outperforms DeepDive-32B on BrowseComp, BrowseComp-ZH, and xbench. It also exceeds Tongyi DeepResearch on BrowseComp-ZH, 48.4 versus 46.7, despite Tongyi using a heavier CPT + SFT + RL pipeline. Among the SFT-only 30B-class models listed in Table 2, OpenSeeker is the strongest across the reported benchmarks.
But the paper is not a universal “OpenSeeker beats everything” story. On BrowseComp, OpenSeeker’s 29.5 is below WebSailor-V2-30B with SFT + RL at 35.3, below Tongyi DeepResearch at 43.4, and below several closed or much larger open models. On xbench, Tongyi DeepResearch reports 75.0, slightly above OpenSeeker’s 74.0. GPT-5-High and OpenAI-o3 also remain far ahead on some English search benchmarks in the broad table.
This boundary actually makes the result more credible, not less. OpenSeeker is not demolishing frontier closed systems across the board. It is showing that a transparent, academic, SFT-only pipeline can reach a surprisingly high tier with far less disclosed training machinery. The right interpretation is data efficiency, not total market conquest. We can safely leave “total market conquest” to pitch decks and other minor works of fiction.
The Chinese result is not just a leaderboard ornament
The BrowseComp-ZH result deserves extra attention because it is easy to treat it as one row in a table. OpenSeeker scores 48.4, above Tongyi DeepResearch’s 46.7 and far above several other 30B-class SFT baselines. The paper connects this to the difficulty of its Chinese data.
The authors report that the Chinese OpenSeeker dataset contains only about 1.4k samples, but those samples are long and difficult. Their analysis shows OpenSeeker-v1-Data-ZH averaging 46.35 tool calls per trajectory and 76.1k tokens, compared with BrowseComp-ZH’s 26.98 tool calls and 15.1k tokens under the same inference model.
This is not merely “more tokens, therefore better.” Longer tasks can be useless if they are noisy, repetitive, or badly grounded. The interpretation is narrower: OpenSeeker’s Chinese dataset appears to create search tasks with deeper tool-use requirements, and those requirements plausibly train the agent to persist through longer evidence chains.
That matters for non-English AI strategy. Many companies assume localization is mostly a translation layer: take an English workflow, translate prompts, add regional documents, and hope the system behaves. OpenSeeker suggests a stronger approach. Localized agents may need localized task synthesis. The search paths, entity ambiguity, document structures, and web topology differ by language and region. Training data should reflect that.
For a business building agents in finance, procurement, law, healthcare operations, or government services, this is not a small point. A multilingual agent trained on generic English reasoning may speak the language while failing the workflow. Fluency is cheap. Local search competence is not.
The paper’s real business lesson is data curriculum design
OpenSeeker can be read as a technical paper about search agents. For builders, it is also a paper about curriculum design.
The authors explicitly emphasize controllability. By changing the subgraph size, entity obfuscation, and graph configuration, the pipeline can alter task complexity and information coverage. That makes the training data less like a static dataset and more like an adjustable problem generator.
This is where the paper becomes more generally useful.
A company building a domain agent should ask three questions before fine-tuning anything:
| Design question | OpenSeeker-inspired version | Enterprise example |
|---|---|---|
| What makes a task hard? | Is difficulty caused by multi-hop evidence, ambiguity, noise, or time horizon? | A compliance agent must connect policy clauses, transaction metadata, and regulator guidance |
| What makes a task valid? | Can the answer be verified if the right evidence is provided? | A procurement agent’s supplier-risk conclusion must trace back to filings, sanctions lists, and shipment data |
| What should the agent practice? | Which search paths resemble real analyst work? | An investment research agent must move from event signal to filings, competitor disclosures, and price-sensitive context |
This is a more serious framing than “fine-tune on internal documents.” Fine-tuning on documents teaches style and local knowledge. Fine-tuning on structured, verified, long-horizon search tasks teaches behavior.
The distinction is not academic hair-splitting. Many AI systems fail not because they lack access to data, but because they were never trained on the work pattern. They can quote a policy but cannot investigate an exception. They can summarize a contract but cannot trace a risk across three annexes and an email thread. They can answer a product FAQ but cannot resolve a customer issue that requires checking order history, warranty terms, and a regional service rule.
OpenSeeker’s pipeline points toward a different development pattern: build a synthetic task factory around the actual work graph, filter tasks for difficulty and solvability, generate expert trajectories with scaffolding, and train the deployable agent against raw operational context.
That is not cheaper than prompt engineering. It is more demanding. But it is also closer to how durable capability is built.
What the evidence says, and what Cognaptus infers
It is useful to keep the layers separate.
The paper directly shows that OpenSeeker, trained with simple SFT on 11.7k synthesized samples, reaches strong performance across four search benchmarks. It shows especially strong relative results among SFT-only 30B-class agents and a notable BrowseComp-ZH result against a heavier industrial baseline. It also shows that the released dataset is difficult in terms of tool-call length and token length, especially for the Chinese subset.
Cognaptus infers that the strategic asset is not merely the final model, but the repeatable data synthesis process. If a small team can build high-value search behavior by generating hard, verifiable, graph-grounded tasks, then enterprise AI teams should care less about generic agent wrappers and more about domain-specific task construction.
What remains uncertain is equally important. The paper does not fully ablate every component of the pipeline. It does not prove that graph-grounded QA synthesis alone contributes a specific percentage, or that denoised trajectory synthesis alone contributes another specific percentage. It does not show broad enterprise-tool integration beyond web search. It also reports a single training run, leaving open the question of how robust the result is across hyperparameters, base models, filtering strategies, and larger data scales.
Those uncertainties do not weaken the core lesson. They define its boundary. OpenSeeker is not a magic recipe for all agents. It is strong evidence that data construction quality can compensate for a surprising amount of training complexity in deep-search settings.
The limitations are practical, not decorative
Several limitations should affect how builders use this paper.
First, the evaluation is benchmark-based. Benchmarks like BrowseComp, BrowseComp-ZH, xbench-DeepSearch, and WideSearch are useful because they stress multi-step search. But business environments contain permissions, stale internal records, conflicting data owners, compliance constraints, latency budgets, and users who write “please check this” while attaching the wrong file. Benchmark strength is a necessary signal, not an enterprise-readiness certificate.
Second, the current OpenSeeker pipeline is focused on web search. The authors explicitly mention future work extending beyond pure web search to more diverse tools and data sources. That extension is not automatic. Searching the open web is different from navigating CRM systems, ERP workflows, internal knowledge bases, code repositories, contract databases, and messaging archives.
Third, the paper leaves room for optimization. The authors state that resource constraints limited them to a single run without heuristic filtering or hyperparameter tuning. That is promising because performance may improve. It is also a warning because the reported result is not yet a systematic scaling study.
Fourth, openness is necessary but not sufficient. Releasing data and weights makes research reproducible and accelerates community progress. It does not remove the need for governance, data quality control, privacy handling, and domain-specific evaluation when similar methods are applied inside companies.
The practical conclusion is therefore not “download OpenSeeker and declare victory.” The practical conclusion is “study the data factory.” The model is useful. The recipe is the more transferable asset.
The uncomfortable future: agents as manufactured experience
OpenSeeker’s most useful contribution is not that it makes search agents open. It is that it makes their training logic inspectable.
The paper shows a plausible route from web structure to difficult questions, from difficult questions to expert trajectories, and from clean teacher decisions to raw-context student learning. That route explains why 11.7k samples can matter. The samples are not just examples. They are carefully shaped experiences.
For AI product teams, this changes the build question.
The weaker question is:
Which model should we use for our agent?
The stronger question is:
What experience should the agent repeatedly practice until the behavior becomes reliable?
Model choice still matters. Compute still matters. RL may still matter. But OpenSeeker is a reminder that capability often comes from the problem environment. If the agent practices shallow tasks, it becomes shallow with confidence. If it practices ambiguous, grounded, verifiable, long-horizon tasks, it has a better chance of becoming useful.
This is not the end of the search monopoly. That would be too neat, and reality dislikes clean endings. But it is a meaningful crack in the data moat. OpenSeeker makes the quiet part visible: the future of search agents will be shaped less by who can name the largest model and more by who can manufacture the best training experience.
That is less cinematic than a frontier-model launch. It is also harder to copy.
Cognaptus: Automate the Present, Incubate the Future.
-
Yuwen Du, Rui Ye, Shuo Tang, Xinyu Zhu, Yijun Lu, Yuzhu Cai, and Siheng Chen, “OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data,” arXiv:2603.15594, 2026. https://arxiv.org/abs/2603.15594 ↩︎