Contracts are not polite. They hide the important clause on page 83, define the crucial exception on page 17, and bury the fatal cross-reference in an appendix nobody wanted to read. Annual reports behave similarly. So do medical SOPs, litigation files, policy manuals, technical logs, and most documents produced by institutions that have discovered both Microsoft Word and committees.

This is precisely where long-context models are supposed to shine. Give the model the whole document, ask the question, collect the answer, invoice the client. Lovely, except for the small inconvenience that long-context models often fail in a very human way: they read, but they do not always notice. Relevant information placed in the middle of a long input can be underused, a failure pattern usually called “lost in the middle.”

The paper behind Tree of Agents, or TOA, attacks that problem with a useful shift in framing.1 It does not merely ask, “How do we make one model swallow more tokens?” It asks, “What if the order of reading changes the answer?” That sounds obvious when humans do it. It becomes oddly radical when applied to LLM systems, where the industry’s default instinct is still to buy a larger context window and hope the middle becomes less invisible. Hope, as usual, is not an architecture.

TOA’s answer is to split a long document into chunks, assign each chunk to an agent, let agents form local views, then make them explore other chunks in different orders through tree-structured paths before the system votes on a final answer. The contribution is not “more agents,” and it is not a prettier RAG wrapper. The core idea is that long-context reasoning is partly an ordering problem. Different reading sequences can produce different interpretations. TOA tries to exploit that instead of pretending the document has one neutral, frictionless path from start to finish.

The long-context problem is not only memory; it is attention allocation

The usual enterprise story around long context is capacity-driven. A model with 128K, 200K, or 2M tokens sounds like a model that can “read everything.” The paper is more sceptical, and rightly so. A larger context window gives the model access to more text, but access is not the same as reliable use. More context can also mean more irrelevant material, weaker focus, and more opportunities for evidence to be diluted by its neighbours.

TOA sits among several families of long-context solutions:

Approach What it tries to fix Typical risk
Long-context model training or model modification Expands the amount of text a model can process Expensive, complex, and still not immune to position bias
Retrieval or prompt compression Sends the model a smaller, denser input May discard evidence needed for multi-hop or global reasoning
Sequential multi-agent reading Splits the document and combines chunk-level findings Can suffer from information decay along a fixed chain
TOA-style tree reasoning Tests multiple chunk-reading orders before consensus Adds orchestration cost and depends on disciplined pruning

The important distinction is that TOA does not try to make the document shorter first. Nor does it rely on one leader agent to decompose the task and direct everyone else. It treats each chunk reader as a partial observer, then makes those observers exchange evidence and probe the document from multiple angles.

That matters because some long-document questions are not local. A local question asks, “What is the termination date in section 12?” Retrieval can often handle that. A global question asks, “Which party has practical leverage if the supplier misses delivery during a force majeure event?” That answer may require definitions, exceptions, notice periods, remedies, and commercial context distributed across the file. The failure mode is not simply that the model cannot find a sentence. It may find several sentences and still assemble the wrong story.

TOA turns reading into a three-phase process

The paper’s mechanism is clean enough to be useful outside the paper. TOA has three phases: chunk perception, multi-perspective understanding, and consensus formation.

First, the document is split into chunks, with one agent assigned to each chunk. Each agent receives the user query and its own document segment. It extracts relevant evidence and proposes a provisional answer. This is intentionally local. The agent is not pretending to know the whole document. It produces a chunk-level cognition: “Here is what my part says, and here is what I think the answer might be.”

Second, the agents read one another’s initial cognitions. This is where TOA starts to differ from a simple voting scheme. Each agent decides which other chunks may help refine its view. It then explores those selected chunks through different reading orders. If an agent wants to inspect chunks 2, 3, and 4, TOA can explore paths such as 2→3→4, 3→4→2, or 4→3→2. These paths form the “tree.”

Third, each agent forms a final local answer based on its best accumulated path, and the system aggregates those answers through majority voting. If agents disagree, the vote provides a simple consensus mechanism. If there is a tie, the paper’s implementation uses an additional independent decision step based on the agents’ factual conclusions.

A simplified view looks like this:

Long document
Chunk 1 → Agent 1 → local evidence + provisional answer
Chunk 2 → Agent 2 → local evidence + provisional answer
Chunk 3 → Agent 3 → local evidence + provisional answer
Agents inspect one another’s cognitions
Selected chunks are explored in multiple reading orders
Caching and pruning reduce repeated or useless paths
Each agent finalises an answer
Majority vote produces the final response

The tree is not decorative. It is doing the work. A fixed reading order can encourage the model to interpret later evidence through the assumptions formed earlier. The paper’s appendix gives a simple illustrative example: the same short facts can imply different interpretations depending on whether the reader encounters “the room was messy” before or after “he had just finished a big project.” That appendix is not main evidence; it is an explanatory device. Its purpose is to make the ordering hypothesis intuitive before the experiments test whether TOA helps on long-context tasks.

The expensive part is controlled with caching and pruning, not wished away

Multi-perspective reading has an obvious problem: paths multiply. If every agent explores every possible order of every useful chunk, the method becomes the AI equivalent of asking five interns to read the same binder in every possible sequence. Admirably thorough. Financially suspicious.

TOA introduces two efficiency controls.

The first is prefix-hash caching. Many tree paths share prefixes. If one path has already computed the agent’s cognitive state after reading chunks 0→1→2, another path that begins 0→1→2 does not need to recompute that same state. It can retrieve the cached cognition and continue from the branch point. This turns repeated reasoning into lookup where possible.

The second is adaptive pruning. When an agent reads a chunk along a path and judges it useless for the current question, the remaining continuation of that path can be terminated. This matters because useful information is sparse in many long documents. Most chunks are not equally relevant. A system that treats them as equally promising is not robust; it is merely expensive in a well-lit way.

The appendix efficiency results clarify the role of these controls. On NovelQA using DeepSeek-V3, TOA’s phase-two reasoning would require an estimated 2,103 API calls without caching and pruning. Caching alone reduces that to 1,830 calls, a 13.0% reduction. Combining caching and pruning reduces phase-two calls to 1,034, a 50.8% reduction. Across all phases, TOA uses 2,534 calls, compared with 2,287 for COA.

That last comparison is the sober part. TOA reduces its own combinatorial overhead, but it is still not cheaper than a simpler chain method in the reported call count. Token statistics tell the same story with more texture: on DeepSeek-V3, caching and pruning cut TOA’s token usage by 59% on NovelQA and 33% on DetectiveQA. On LLaMA3.1-8B, the savings are only about 5% in both datasets. The optimisation helps, but its effect depends on the base model’s ability to judge usefulness and on the dataset.

So the business reading is not “TOA makes multi-agent reasoning cheap.” It is more precise: TOA makes a costly multi-perspective method less wasteful, while preserving enough accuracy gains that the extra spend may be justified for hard long-document questions. That is less glamorous. It is also closer to procurement.

The main evidence: better answers on long-context QA, not magic document intelligence

The experiments use two long-context reasoning datasets and one retrieval-style benchmark. DetectiveQA involves detective novel questions with long inputs; NovelQA tests question answering over novels exceeding 200K tokens, with the paper focusing on multi-hop reasoning questions. The authors also use Needle-in-a-Haystack tests, where one or more specific facts are inserted into a long unrelated document and the model must retrieve them.

The paper evaluates TOA against COA, LONGAGENT, LongLLMLingua, LongRAG, Sequential, Vote, and commercial long-context models GPT-4o and Gemini 1.5 Pro in relevant settings. LLaMA3.1-8B and DeepSeek-V3 serve as base models for TOA and most baselines. The authors sample 100 examples from each QA dataset due to computational constraints and report results over three independent runs.

The results support three different claims, and those claims should not be mashed together.

Experiment or table Likely purpose What it supports What it does not prove
DetectiveQA and NovelQA comparison Main evidence TOA improves accuracy over several long-context baselines on sampled multi-hop QA tasks That TOA will outperform all long-context models on all enterprise document workflows
Needle-in-a-Haystack figures Main evidence / position-bias probe TOA is more stable when relevant facts appear at different depths, including middle positions That synthetic retrieval fully captures legal, financial, or operational reasoning
COA chunk-size comparison Sensitivity test against a sequential baseline Larger chunks help COA up to a point, but TOA remains stronger in the reported setup That TOA’s chosen chunking is universally optimal
Input-length robustness figure Robustness test TOA appears more stable as input length grows, while COA degrades beyond longer ranges Exact production latency or cost under enterprise workloads
Agent-number table Ablation / sensitivity test Five agents perform best among 3, 5, and 7 in the reported QA setting That five agents is a universal rule
Efficiency appendix Implementation and cost analysis Caching and pruning reduce TOA’s overhead, especially with DeepSeek-V3 That TOA is cheaper than simpler baselines

With LLaMA3.1-8B, TOA reaches 54.3% accuracy on DetectiveQA and 45.0% on NovelQA. On DetectiveQA, this beats LONGAGENT at 48.7%, LongRAG at 37.0%, LongLLMLingua at 30.7%, and COA at 25.3%. On NovelQA, TOA’s 45.0% is slightly ahead of LongRAG’s 44.0% and higher than LONGAGENT’s 37.3%, Vote’s 34.3%, COA’s 26.3%, and LongLLMLingua’s 17.0%.

The commercial comparison is interesting but should be read carefully. GPT-4o records 56.0% on DetectiveQA and 48.7% on NovelQA. Gemini 1.5 Pro records 55.7% and 45.7%. So LLaMA3.1-8B with TOA is close to the commercial models on these sampled tasks, but not uniformly better. With DeepSeek-V3 as the base model, TOA reaches 57.3% on DetectiveQA and 47.3% on NovelQA, which exceeds the commercial models on DetectiveQA and sits below GPT-4o but above Gemini 1.5 Pro on NovelQA.

None-rate adds another useful layer. In the paper, “None” means the model indicates it cannot retrieve relevant information to answer. With LLaMA3.1-8B, TOA’s none-rate is 1.7% on DetectiveQA and 4.3% on NovelQA. That is low, and in DetectiveQA it comes with the best accuracy among the LLaMA-based systems. On NovelQA, Vote has a lower none-rate of 0.3%, but weaker accuracy at 34.3%. This distinction matters. A system that always guesses may have a low abstention rate and still be operationally dangerous. A useful system must balance answer rate with correctness.

The Needle-in-a-Haystack results are more directly aligned with the “lost in the middle” claim. TOA achieves average scores of 9.38 in the single-needle setting and 7.77 in the multi-needle setting. The paper reports stable performance around middle depth ranges where Sequential and Vote drop. This is the cleanest evidence for the mechanism: multi-path reading is not merely improving aggregate QA scores; it appears to reduce sensitivity to where the evidence is placed.

The agent-count result is a warning against “just add more agents”

The agent-number table deserves special attention because it punctures a common but lazy interpretation. If TOA works because agents are good, then more agents should be better. The paper says no.

Using three agents, TOA reaches 46.0% on DetectiveQA and 25.0% on NovelQA, with none-rates of 17.0% and 32.0%. Using five agents, performance rises to 54.3% and 45.0%, with none-rates around 1.7% and 4.0%. Using seven agents, performance falls to 51.0% and 28.0%, with none-rates of 4.0% and 19.0%.

The mechanism is plausible. Too few agents means each agent receives a larger chunk, so middle-position failures can reappear inside the chunk. Too many agents means excessive fragmentation, making synthesis harder. Five agents is the sweet spot in this experimental configuration, not a law of nature. The lesson for deployment is to tune chunking and agent count together. Treating “agent” as a magic noun will get expensive quickly.

The COA chunk-size experiment reinforces the same point from another angle. COA improves as chunk size increases from 4K to 16K, then drops at 32K. Larger chunks reduce fragmentation, but they also increase the burden on each agent and preserve the risk of attention degradation. TOA’s advantage is not simply chunk size. It is the combination of chunking, cross-agent inspection, path exploration, caching, pruning, and voting.

Why this is not just RAG in a nicer jacket

It is tempting to squeeze TOA into the RAG bucket because both approaches deal with too much text. That would miss the paper’s sharper contribution.

RAG usually asks: “Which chunks should we retrieve for this question?” TOA asks a different question: “How should partial readers update one another when different chunks suggest different answers?” Retrieval is a selection problem. TOA is an interpretation-order problem.

That difference matters in enterprise settings where the question cannot be answered by one isolated chunk. A financial covenant may depend on definitions, exclusions, and schedules. A policy exception may require the reader to reconcile the main rule, a footnote, and an approval workflow. A customer dispute may turn on the sequence of interactions rather than any single message. RAG can still be part of the system, but TOA points toward a higher-level orchestration layer: retrieve or chunk first, then deliberately test competing paths of interpretation.

This also explains why the paper’s prompt templates are more practically relevant than they first appear. Phase-one prompts require agents to extract evidence and answer from their assigned chunk without external knowledge. Phase-two prompts ask agents to select helpful peer responses and evaluate additional chunks as useful or useless. Phase-three prompts force a structured final answer, including “None” when uncertain. These templates are implementation details, but in production they become governance hooks. They create inspectable artefacts: evidence, utility judgments, path decisions, final votes, and tie-breaks.

A compliance team cannot audit “the model read the whole thing and vibes were achieved.” It can audit which chunks were read, which ones were rejected, which paths were pruned, and where agents disagreed. TOA is not automatically compliant, of course. But it gives the system places to attach logs, tests, and escalation rules. That is more than most context-window heroics offer.

What Cognaptus would infer for business use

The paper directly shows that TOA improves performance on sampled long-context QA benchmarks and needle retrieval tests, using specific base models and experimental settings. It also shows that caching and pruning reduce TOA’s own overhead, especially with DeepSeek-V3, though TOA remains somewhat costlier than COA in the reported call and token comparisons.

The business inference is narrower than “replace your long-context stack.” A more defensible inference is this:

TOA is most attractive as an escalation layer for high-value, high-ambiguity document questions where local retrieval is not enough and wrong answers are costly.

That points to a router-based deployment pattern:

  1. Try cheap retrieval or direct long-context answering first for local questions.
  2. Check whether retrieved evidence is concentrated, consistent, and sufficient.
  3. Escalate to TOA-style multi-perspective reasoning when evidence is distributed, conflicting, or position-sensitive.
  4. Log path decisions, utility judgments, votes, and abstentions.
  5. Send unresolved disagreements to a human reviewer rather than pretending majority vote is divine revelation.

In operational terms, TOA is not a universal default. It is a premium reasoning path. Use it for due diligence, contract exception analysis, policy reconciliation, complex support histories, audit memos, and investigation files. Avoid it for simple lookup, single-section extraction, or workflows where latency dominates value.

The return on investment is therefore not “cheaper tokens.” The paper does not establish that. The ROI case is reduced error on hard long-document tasks, better traceability of intermediate evidence, and a more disciplined way to decide when a model should answer versus abstain. That is a less theatrical claim, which makes it more useful.

The boundaries are real, and they affect adoption

The paper’s limitations are not cosmetic. They shape how seriously one should take the deployment story.

First, the main QA experiments use 100 sampled examples from each dataset, partly due to computational constraints. The authors repeat runs and report mean and standard deviation, which helps, but the sample size still matters. A procurement team should not treat these results as a universal benchmark across legal, financial, healthcare, and engineering documents.

Second, the datasets are mostly question-answering tasks, including multiple-choice QA and synthetic needle retrieval. Those are useful probes, especially for position bias, but they are not the same as open-ended memo drafting, adversarial contract review, regulatory classification, or multi-document evidence synthesis. TOA may help there; the paper does not prove it.

Third, TOA depends on the base model’s ability to judge which chunks are useful. The efficiency appendix makes this visible. Caching and pruning save far more tokens with DeepSeek-V3 than with LLaMA3.1-8B in the reported setup. A weaker model may not prune well, may misjudge relevance, or may confidently preserve the wrong path. The framework is plug-and-play, but not intelligence-free.

Fourth, majority voting reduces some individual-agent errors but does not eliminate correlated failure. If all agents share the same blind spot, the vote merely formalises it. This is especially relevant in enterprise documents where ambiguity, missing definitions, or inconsistent drafting can mislead every reader in the same direction.

Fifth, latency remains an issue. Even with caching and pruning, TOA adds orchestration overhead. For high-volume workflows, it should be routed selectively and budgeted explicitly. No one should discover the tree explosion from an invoice.

The useful idea is structured disagreement

The strongest lesson from TOA is not that trees are fashionable or that agents have once again been invited to the meeting. The useful idea is that long-document understanding benefits from structured disagreement. Different chunks produce different provisional answers. Different reading orders produce different interpretations. A serious system should expose those differences, test them, and then aggregate them with traceable rules.

This is a more mature view of long-context AI. It moves away from the fantasy of one giant context window as an all-seeing clerk. It treats document understanding as a process: local perception, selective cross-reading, order-sensitive interpretation, cost-aware pruning, and consensus. That process is messier than “upload PDF, get answer.” It is also closer to how actual institutional knowledge work gets done, minus the calendar invites and the stale biscuits.

For Cognaptus clients, the immediate takeaway is practical. If the task is local, stay simple. If the task is global, position-sensitive, and expensive to get wrong, a TOA-style architecture is worth testing. Not because it is magic. Because it gives long-context reasoning something it badly needs: a procedure.

And procedures, unlike context windows, can be audited.

Cognaptus: Automate the Present, Incubate the Future.


  1. Song Yu, Xiaofei Xu, Ke Deng, Li Li, and Lin Tian, “Tree of Agents: Improving Long-Context Capabilities of Large Language Models through Multi-Perspective Reasoning,” arXiv:2509.06436, 2025. ↩︎