A spreadsheet is a cruel test of artificial intelligence.
Not the toy spreadsheet used in demos, with six rows, three columns, and a suspiciously cooperative universe. I mean the kind of table a real analyst asks for: every qualifying supplier in a region, every product SKU released over a decade, every regulatory filing matching a narrow condition, every competitor with exact addresses, dates, sources, and no missing cells because apparently human suffering needs columns.
This is where many “deep research” agents become less deep and more theatrical. They reason beautifully for a while, gather a few promising sources, then quietly start dropping rows, merging entities, repeating claims, or declaring that a representative sample is good enough. Representative samples are charming. They are also not what the user asked for.
InfoSeeker, introduced in InfoSeeker: A Scalable Hierarchical Parallel Agent Framework for Web Information Seeking, is interesting because it treats this failure not as a shortage of intelligence, but as a shortage of structure.1 The paper’s core argument is simple enough to be dangerous: wide information work should not be forced through one long, shared reasoning thread. It should be split into semi-independent modules, executed in parallel, verified locally, and compressed upward.
In other words, the agent should stop pretending to be a lone genius. It should behave more like a disciplined research operation.
The mistake is treating wide search as deeper reasoning
The AI industry has spent several years rewarding depth. Longer chain-of-thought-style reasoning. More reflection. More browsing rounds. More tool calls. More context. More everything, because “more” is the easiest product roadmap to explain at a conference.
That emphasis is not wrong. Sequential reasoning matters when a task depends on a chain of dependent inferences: identify a clue, use it to find another clue, resolve ambiguity, then answer. But wide information seeking is a different animal. It is not primarily about following one fragile reasoning path. It is about discovering many relevant entities, validating attributes across heterogeneous sources, and assembling a complete structure without drowning the model in its own intermediate evidence.
The paper identifies three familiar failure modes in current agentic search systems:
| Failure mode | What happens in practice | Why it hurts business use |
|---|---|---|
| Context saturation | Raw evidence, pages, tables, snippets, and intermediate notes flood the shared context. | Important exceptions disappear. The model becomes confidently approximate. A lovely phrase, until it is your compliance table. |
| Cascading error propagation | Early mistakes become assumptions for later steps. | The agent builds a polished answer on a wrong entity list or a misread source. |
| Latency explosion | Sequential tool use turns a broad task into a long wait. | Research agents become too slow for routine workflows, even when they are technically capable. |
A larger context window delays saturation. It does not remove the structural problem. A stronger model may reason better inside the same bottleneck, but it still has to carry too much unrelated detail through too many steps. The misconception InfoSeeker pushes against is therefore useful: deep-research failure is not always caused by a model failing to “think hard enough.” Sometimes the system is asking one mind to do the work of a small department while also storing the filing cabinet in its head.
The paper’s answer is near-decomposability: separate the system into modules that work locally on details and communicate upward through summaries. That phrase sounds academic because it is. Its operational meaning is more practical: keep the messy work where it belongs.
InfoSeeker’s mechanism: one Host, several Managers, many Workers
InfoSeeker uses a three-layer architecture: Host, Managers, and Workers. The design is not merely “multi-agent” in the decorative sense where several agents are given names and encouraged to chat until the token bill develops ambition. Each layer has a different information boundary.
| Layer | Main job | What it sees | What it should not see |
|---|---|---|---|
| Host | Maintains the global plan and decides the next high-level step. | The user query and compressed step-level results from Managers. | Raw tool traces, page dumps, worker-level clutter. |
| Manager | Decomposes a Host step into domain-specific subtasks, validates results, and aggregates them. | The assigned step, worker outputs, domain-local evidence. | The full global history unless needed. |
| Worker | Executes atomic tool interactions such as search, browsing, file processing, or code execution. | Its assigned subtask and tool outputs. | The whole project context. |
This is controlled ignorance. The Host is powerful because it is protected from details. Workers are useful because they can focus on narrow jobs without being asked to understand the whole mission. Managers provide the membrane: they decompose, verify, revise, and summarize.
The paper formalizes this with a Host context made of step-response pairs. The Host generates a step, routes it to a Manager, receives a summarized result, and updates its plan. Managers decompose their assigned step into parallel subtasks, dispatch those subtasks to Workers, reflect on whether the results are acceptable, revise if necessary, and then aggregate. Workers execute tool calls locally and return only the subtask result.
That layered boundary is the mechanism. Without it, parallelism easily becomes chaos: many branches, many snippets, many partial claims, and one final model forced to reconcile everything under pressure. With it, the system gets two advantages at once: bounded context and wider execution.
The simplest way to see the latency argument is this:
$$T_{sequential}=\sum_{i=1}^{n}T_i$$
If a system performs subtasks one by one, the wall-clock time accumulates. But when weakly coupled subtasks can run in parallel, the wall-clock time moves closer to:
$$T_{parallel}=\max_i T_i$$
Coordination overhead still exists. Workers are not magic elves, sadly. But the economic direction changes: the system waits for the slowest relevant branch rather than every branch in sequence. For wide search, where many subtasks are independent after decomposition, that distinction is not cosmetic. It is the business model.
The main evidence: InfoSeeker wins where width is the bottleneck
The paper evaluates InfoSeeker on two complementary benchmarks. WideSearch stresses broad structured information synthesis: agents must populate complete tables using many sources, with strict penalties for missing or spurious rows. BrowseComp-zh stresses Chinese web browsing and multi-hop evidence alignment in a native Chinese web environment.
These are not identical tests. That matters. WideSearch tests the wide-table problem most directly. BrowseComp-zh tests whether the architecture can still coordinate search, browsing, and reasoning in a linguistically and infrastructurally different web setting.
| Test or analysis | Likely purpose | Key reported result | What it supports | What it does not prove |
|---|---|---|---|---|
| WideSearch main benchmark | Main evidence for wide information synthesis. | InfoSeeker reports 8.38% Avg@4 success, 50.13% Avg@4 Row F1, and 70.27% Avg@4 Item F1 in the paper’s main comparison. | Hierarchical orchestration improves completeness and attribute accuracy on broad table-building tasks. | It does not mean exhaustive business databases are solved; success rates remain low under strict scoring. |
| BrowseComp-zh main benchmark | Main evidence for web navigation and cross-lingual robustness. | InfoSeeker reaches 52.9% accuracy, above OpenAI DeepResearch at 42.9% and BrowseMaster at 46.5%. | The same architecture helps when search must interact with native Chinese web pages and multi-hop clues. | It does not isolate language ability from tool quality, browsing setup, or benchmark-specific prompt design. |
| Identical-tool single-agent comparison | Architecture ablation. | On WideSearch-en, GPT-5.1 single-agent reaches 6.00% success and 35.74% Item F1; InfoSeeker reaches 12.50% success and 75.21% Item F1. | Gains are not just because InfoSeeker had better tools or a stronger model. Structure matters. | It does not prove every future backbone will show the same gap. |
| Worker-count latency analysis | Sensitivity/ablation on parallel execution width. | Latency falls from 911 seconds with one worker to 162 seconds with 17 workers, about a 5.7× speed-up on sampled WideSearch-en queries. | Wide subtasks can benefit strongly from worker parallelism. | It does not guarantee linear scaling forever; rate limits, overhead, and source bottlenecks still matter. |
| Case studies and failure cases | Implementation explanation and boundary diagnosis. | Michelin and Chinese historical-riddle examples show successful decomposition; disease and AMD CPU cases show entity-linking and context-volume failures. | The system’s behavior is interpretable enough to diagnose. | Case studies are not statistical proof. They explain mechanisms and failure modes. |
The WideSearch numbers deserve a careful reading. The headline success rate, 8.38% Avg@4, may look low if read like a normal QA benchmark. But WideSearch is strict: a task can require complete row discovery, correct attributes, and schema compliance across many sources. Under that evaluation, even human annotators are reported to achieve below 20% success on the English split. The stronger story is therefore not “InfoSeeker solves wide search.” It does not. The stronger story is that it improves the failure surface: Row F1 and Item F1 rise because the system is better at discovering entities and preserving attribute accuracy.
The BrowseComp-zh result adds another layer. InfoSeeker’s 52.9% accuracy beats both commercial deep-research and open agent baselines listed in the paper. The mechanism is not that the Host suddenly becomes a Chinese-web savant. It is that environment-specific work can be routed to a Browser Manager, while the Host continues to reason from compressed evidence. The architecture lets web interaction be specialized without forcing global planning to ingest every messy DOM detail.
The ablation matters because it removes the easy excuse
The most useful experiment in the paper is not the most glamorous table. It is the single-agent baseline with identical tool access.
The authors compare InfoSeeker against single-agent configurations using the same tool access and backbone model families: GPT-5.1, used for Host and Managers, and GPT-5-mini, used for Workers. On WideSearch-en, the GPT-5.1 single-agent setup reaches 6.00% success, 31.85% Row F1, and 35.74% Item F1. InfoSeeker reaches 12.50% success, 50.13% Row F1, and 75.21% Item F1.
This is the paper’s cleanest argument against the “just use a stronger model” reflex. The stronger single agent still loses to the hierarchical system. The architecture does not merely wrap a model in more prompts; it changes how evidence is distributed, checked, and compressed.
There is another subtle cost point. The paper reports that more than 80% of token consumption occurs in GPT-5-mini Workers, yet the full system surpasses the stronger GPT-5.1 single-agent baseline. That is architectural arbitrage: reserve expensive reasoning for planning and aggregation, push repetitive evidence execution into cheaper workers, and avoid contaminating the global context with every intermediate artifact.
For business systems, this matters more than a leaderboard bump. The cost of research automation is not only the token cost of one answer. It is the cost of repeating that answer across thousands of recurring tasks: monitoring vendors, tracking market moves, assembling due-diligence tables, checking compliance changes, or maintaining internal knowledge bases. A system that can route expensive cognition sparingly has a different unit-economics profile from a monolithic agent that brings the premium model to every small errand like a consultant who invoices by the paragraph.
The paper reports average costs of roughly $2.00 per WideSearch-en task and $1.00 per BrowseComp-zh task. Those figures should not be treated as universal pricing guidance. They depend on model pricing, tool stack, concurrency, prompt design, benchmark mix, and API conditions. But they indicate the intended design direction: scale execution width without making every token a luxury item.
The case studies show how decomposition actually works
The Michelin restaurant example in the appendix is a useful miniature of the whole system. The user asks for a complete Markdown table of Michelin three-star restaurants in Paris as of a date, including cuisine style and exact address. A single-agent approach might search, read, store, and format everything in one growing context. InfoSeeker instead proceeds in phases.
First, the Host asks for a reliable list of qualifying restaurants. The Manager parallelizes retrieval across official Michelin sources and Wikipedia-like lists, using overlapping sources for coverage and cross-checking. Once the entity list is established, the next step decomposes attribute collection: each restaurant’s cuisine style and exact address can be searched independently. Workers gather the details, Managers aggregate, and the Host produces the table.
That is the correct decomposition: first discover the row set, then fill columns. It sounds obvious because humans who build datasets already think this way. The point is that many agents do not. They blur entity discovery, attribute verification, and final formatting into one long conversational stream. Then they are shocked, absolutely shocked, when rows go missing.
The BrowseComp-zh historical riddle case shows a different pattern. The Host identifies that the task needs broad historical search and then precise webpage reading. The Search Manager locks onto Guo Ziyi and the drama Zui Da Jin Zhi. When source access becomes difficult, the system escalates to the Browser Manager, which navigates the relevant page and confirms the answer: Emperor Daizong. The useful detail here is not just correctness. It is manager routing. Search is good for broad candidate discovery; browsing is better for blocked or interactive pages. One tool path does not need to carry the whole burden.
The failure cases are even more important. In one BrowseComp-zh disease question, InfoSeeker predicts variant transthyretin amyloidosis while the gold answer is xeroderma pigmentosum. The paper diagnoses this as an answer-type mismatch: the benchmark expected a canonical disease entity, while the system treated “variant” as a broader genetic/phenotypic clue and selected a plausible disease class.
That failure is not solved by more workers. It is a constraint interpretation problem. The system needs stronger entity-linking discipline and answer-type control.
The WideSearch AMD CPU failure is more brutal. The user asks for a complete table of AMD Zen CPU products across years and categories. The system identifies relevant sources but hits a token overflow while processing high-volume tables, then falls back to a representative sample. This failure punctures any overly neat reading of the paper. InfoSeeker reduces context pressure; it does not abolish context limits. For very large exhaustive tables, the architecture still needs external storage, database-style extraction, chunked processing, and probably code-first pipelines rather than pure conversational aggregation.
The business lesson is an evidence factory, not a smarter chatbot
For Cognaptus readers, the practical interpretation is not “use InfoSeeker exactly as implemented.” The paper uses GPT-5.1 for Host and Managers, GPT-5-mini for Workers, MCP tool integration, Firecrawl-style search, Playwright browsing, and sandboxed file/code tools. That stack is informative, but not the transferable essence.
The transferable design pattern is this:
| Business workflow | Old agent design instinct | InfoSeeker-style replacement |
|---|---|---|
| Market mapping | Ask one research agent to browse until it has “enough.” | Separate entity discovery, attribute collection, source validation, and final synthesis. |
| Vendor due diligence | Feed all webpages and PDFs into a large context. | Use domain Managers for web, files, code, and compliance checks; pass only verified summaries upward. |
| Regulatory monitoring | Let one agent scan updates sequentially. | Run independent jurisdiction or agency checks in parallel, then aggregate exceptions. |
| Competitive intelligence | Generate a narrative report directly from search results. | Build and verify structured evidence first; narrative comes last, not first. |
| Internal knowledge-base maintenance | Dump documents into RAG and hope retrieval behaves. | Use Workers for extraction, Managers for schema validation, Host for update planning. |
The economic shift is from “AI as a smart respondent” to “AI as a controlled evidence pipeline.” That is less glamorous and much more deployable.
A business research agent should know when it is doing four different jobs: finding the row set, filling attributes, verifying conflicts, and writing the final answer. These jobs have different failure modes. Entity discovery needs recall. Attribute filling needs precision. Verification needs authority ranking. Final writing needs schema obedience and clarity. A monolithic agent may perform all four, but it has no clean internal contract between them.
InfoSeeker’s hierarchy creates those contracts. The Host does not need every source. It needs a trustworthy state of the research. Managers do not need to write the final answer. They need to return validated step-level results. Workers do not need to understand the whole business question. They need to execute narrow subtasks reliably.
That separation is where ROI can emerge. Faster runtime helps. Cheaper worker tokens help. But the bigger operational value is diagnosability. When a result is wrong, the organization can inspect whether the failure came from search coverage, browser access, entity linking, schema aggregation, or final synthesis. A single-agent transcript often turns failure analysis into archaeology.
The paper’s limits are deployment instructions in disguise
The limitations section is short, but the appendix makes the boundaries clearer.
First, InfoSeeker depends on strong backbone models and API-accessible tools. Host and Manager quality matters. If the planner decomposes badly, parallel workers merely execute the wrong plan faster. Industrial users should treat decomposition quality as a measurable capability, not as prompt magic.
Second, concurrency is not free. The paper’s latency gains come from parallelism, but real deployments face rate limits, tool costs, crawling failures, CAPTCHAs, and coordination overhead. A 17-worker setup can be efficient for wide tasks; it can also become expensive noise if the task is narrow or if subtasks are highly redundant.
Third, hand-tuned prompts are doing real work. The appendix prompts include detailed rules for Manager behavior: deduplicate entities, avoid redundant queries, escalate to browser when needed, preserve requested formats, retry failures, and return source URLs with relevance notes. This is not incidental decoration. It is operational policy. A company copying only the agent topology but ignoring these behavioral contracts will reproduce the org chart without the management.
Fourth, the system still struggles with answer-type constraints and truly massive exhaustive tables. The disease failure points to semantic disambiguation. The AMD CPU failure points to context-volume and data-engineering limits. For enterprise use, the lesson is not to abandon structured databases and code pipelines. The lesson is to let agents orchestrate them.
Finally, the benchmarks are still benchmarks. WideSearch and BrowseComp-zh are useful because they stress real pain points, but they are not the same as a regulated production workflow. In a compliance or investment setting, correctness requirements may be higher, source policies stricter, and audit trails mandatory. InfoSeeker gives a design direction, not a deployment license.
The next search agent looks more like an organization
InfoSeeker’s contribution is not that it invents parallelism. Parallel systems are old. MapReduce is old. Delegation is old. The interesting part is that the paper brings these ideas into agentic search at the point where LLM products are starting to hit the limits of “one big model plus tools.”
The paper’s mechanism-first lesson is clear: wide intelligence requires narrow local context. Let the Host plan. Let Managers decompose and verify. Let Workers execute. Pass summaries upward. Keep raw evidence local unless it is needed. Do not ask the final answer generator to also be the crawler, database, auditor, and intern who formats the table.
The existing generation of deep-research systems often sells the image of a tireless analyst. InfoSeeker suggests a more useful metaphor: a small research firm with a good operating model. The firm is not smarter because every employee sees every document. It is smarter because the right people see the right documents, at the right level of abstraction, and the final partner does not need to read 300,000 tokens of scraped CPU tables before making a decision.
That is less romantic than artificial general intelligence. It is also closer to how work gets done.
For businesses, the conclusion is not “buy more context.” It is “design the context economy.” Decide what must be global, what should remain local, what can be parallelized, what needs validation, and what must be persisted outside the model altogether. The future of AI search may still involve stronger models. Of course it will. But stronger models inside poorly structured workflows will continue to fail in expensively familiar ways.
InfoSeeker’s quiet provocation is that the next frontier of AI search may not be deeper thought.
It may be better management.
Cognaptus: Automate the Present, Incubate the Future.
-
Ka Yiu Lee, Yuxuan Huang, Zhiyuan He, Huichi Zhou, Weilin Luo, Kun Shao, Meng Fang, and Jun Wang, “InfoSeeker: A Scalable Hierarchical Parallel Agent Framework for Web Information Seeking,” arXiv:2604.02971v1, 2026, https://arxiv.org/abs/2604.02971. ↩︎