When Retrieval Learns to Breathe: Teaching LLMs to Go Wide *and* Deep

Retrieval has a breathing problem.

Most enterprise RAG systems inhale once, grab the nearest chunks, and then hope the model can make the answer sound less fragile than the evidence actually is. That works tolerably well when the user asks for something sitting neatly inside a document paragraph. It works less well when the answer lives across entities, relations, aliases, product categories, authors, diseases, suppliers, regulations, or customer records. In other words, it works less well in the part of business where knowledge is not a pile of text but a network.

The paper behind ARK, short for Adaptive Retriever of Knowledge, starts from this uncomfortable fact: knowledge-graph retrieval is not just “semantic search, but with edges.” It is a control problem.¹ A system must decide when to search globally across the graph, when to follow local relations, when to re-anchor somewhere else, and when to stop before it wanders into graph tourism. Lovely hobby. Bad retrieval strategy.

ARK’s contribution is not that it invents a more glamorous embedding model. It does something more operationally interesting: it gives an LLM a tiny toolset and lets it manage the breadth-depth tradeoff during retrieval. One tool performs global lexical search over node descriptors. The other performs one-hop neighborhood exploration from a retrieved node, with optional node-type and edge-type filters plus query-based ranking. Multi-hop retrieval is not hard-coded as “always expand three hops.” It emerges because the agent can call the neighborhood tool repeatedly, interleave it with global search, and terminate when it has enough evidence.

That sounds modest. It is exactly why the paper matters.

The real problem is not finding nodes; it is controlling movement

A knowledge graph gives a model two temptations.

The first is breadth. Search everywhere. Use text descriptors. Find candidate nodes whose titles, summaries, attributes, or descriptions match the query. This is useful when the answer is text-heavy: “find products similar to this description,” “retrieve papers about this topic,” “locate entities matching this regulatory phrase.” Breadth prevents the system from being trapped in one neighborhood too early.

The second is depth. Follow relations. Move from a drug to its diseases, from a paper to its authors, from an author to related papers, from a product to its reviews, from a client to its subsidiaries and transactions. This is useful when the answer is relational: the relevant node is not merely textually similar to the query but connected through a path.

The common mistake is to treat this as a model-selection problem: choose a better dense retriever, add graph embeddings, or force multi-hop traversal. ARK reframes it as a runtime policy problem. The retriever should not have one fixed behavior. It should inspect the query, observe intermediate results, and adjust.

That is the “breathing” metaphor: inhale broadly, exhale locally, inhale again if the current path is wrong, and stop before suffocation by neighbors.

The paper’s core design can be summarized as follows:

Retrieval action	What it does	Why it matters operationally
Global Search	Searches across all node text descriptors using an agent-generated subquery	Gives the agent broad entry points and lets it recover from bad local anchors
Neighborhood Exploration	Expands from a selected node to adjacent nodes, with optional type filters and query ranking	Lets the agent follow relational evidence without pre-setting hop depth
Parallel Exploration	Runs several independent agents and aggregates their ranked outputs by vote count and earliest occurrence	Improves robustness without simply making one trajectory longer
Trajectory Distillation	Fine-tunes a smaller Qwen3-8B model to imitate teacher tool-use traces	Converts expensive agent behavior into a cheaper deployable policy

This is not a decorative agent wrapper around retrieval. The agent is deciding where retrieval should look next.

ARK’s two tools are deliberately boring

There is a useful discipline in ARK’s tool design. The system does not expose dozens of graph operators. It does not ask the LLM to write arbitrary graph queries. It gives the model two primitive movements.

First, Global Search retrieves the top-$k$ nodes in the entire graph according to a relevance function over node text. The paper implements this relevance function with BM25. That choice may look conservative in an age where every problem is apparently solved by adding a larger embedding model and a dashboard. The experiments later make the conservatism look less quaint.

Second, Neighborhood Exploration starts from a node already retrieved and returns adjacent nodes. The agent can filter by node type and edge type, and it can rank neighbors with a subquery. Each call is one hop, but repeated calls compose into multi-hop traversal. The important part is that the hop count is not fixed in advance. The agent decides whether to keep moving.

This creates a retrieval loop:

Find candidate nodes globally.
Select promising nodes.
Explore their neighborhoods if the query needs relations.
Re-anchor globally if the current path becomes unhelpful.
Finish when the ranked list is good enough or the step budget is exhausted.

For enterprise teams, this is the first useful lesson: graph retrieval should be designed as navigation under uncertainty, not as a one-shot ranking function. A graph is not valuable merely because it stores relations. It is valuable when the retrieval system knows when to use those relations.

The benchmark result is strong, but the pattern matters more than the headline

ARK is evaluated on STaRK, a benchmark for entity-level retrieval over heterogeneous, text-rich knowledge graphs. The three graphs are meaningfully different: AMAZON is an e-commerce graph with roughly 1.0 million entities and 9.4 million relations; MAG is a scholarly graph with about 1.9 million entities and 39.8 million relations; PRIME is a biomedical graph with about 129,000 entities and 8.1 million relations.

That diversity matters because a retriever that performs well only on one graph style is less interesting. Enterprise graphs are rarely tidy. Product catalogs, research databases, customer graphs, supplier networks, and compliance knowledge bases all have different ratios of text, entity types, relation density, and ambiguity.

The main benchmark table reports ARK’s average performance across the three STaRK datasets as 59.14 Hit@1, 71.51 Recall@20, and 67.44 MRR. Among training-free methods, that is the best average result in the paper. It also beats several methods that require training on the target graph on average Hit@1 and MRR, though not every metric on every dataset.

A compressed view:

Method category	Method	Average Hit@1	Average Recall@20	Average MRR	Interpretation
Training-free lexical / dense baselines	BM25, ada-002, GritLM-7B	26.96–31.85	43.57–47.34	36.68–41.61	Broad retrieval alone is useful but shallow
Training-free graph-aware baseline	KAR	45.01	56.11	52.67	Structural hints help, but fixed retrieval remains limited
Training-free agent baselines	Think-on-Graph variants	12.71–20.22	14.41–42.73	14.84–31.43	Traversal can fail when anchoring is brittle
ARK	ARK	59.14	71.51	67.44	Adaptive breadth-depth control is the main advantage
Requires target-graph training	mFAR, MoR, GraphFlow, AvaTaR	37.55–49.63	50.18–71.00	45.53–60.20	Training helps, but does not automatically solve breadth-depth control
Distilled ARK	ARK distilled	49.51	66.31	58.47	Teacher trajectories transfer much of the behavior to a smaller model

The most revealing dataset is MAG. ARK reaches 73.40 Hit@1 and 79.87 MRR, substantially above the other methods reported in the table. On PRIME, GraphFlow has the best Hit@1, but ARK remains strong without reinforcement learning on the target graph. On AMAZON, classical and graph-aware retrieval baselines are already competitive because many queries are descriptive, but ARK still leads the training-free category.

So the paper is not merely saying “agents beat retrievers.” That would be too easy and, frankly, too LinkedIn. The actual message is narrower and more useful: an agent with the right retrieval interface can coordinate global search and relational traversal better than either static retrieval or traversal-only agents.

Figure 3 is the mechanism check, not a decorative chart

The paper’s Figure 3 is important because it tests whether ARK’s behavior matches the type of query it faces. STaRK provides proportions of text-centric versus relation-centric queries. ARK does not receive those labels, but the authors compare them with the tool calls ARK actually makes.

On AMAZON, ARK uses Global Search for 87.7% of tool calls. That fits the dataset’s text-heavy query profile. On MAG and PRIME, where relational retrieval matters more, ARK shifts toward Neighborhood Exploration: 65.3% of tool calls on MAG and 52.3% on PRIME.

This is not main performance evidence in the same sense as Table 1. It is a behavioral alignment check. It asks whether the system’s internal action pattern makes sense relative to the task regime.

That distinction matters. If ARK had produced strong scores while using roughly the same tool pattern everywhere, we would suspect the benchmark win came from the backbone model or some incidental ranking advantage. Instead, the tool-use distribution supports the mechanism claim: the agent changes how it searches depending on what the query appears to require.

For business readers, this is the difference between a system that has “graph support” and a system that actually knows when graph support is useful. Many vendors can connect a graph database to a chatbot. Fewer can show that the retrieval process behaves differently for descriptive, relational, and mixed queries.

The ablation says the neighborhood tool is not optional

The ablation in Table 2 is the paper’s cleanest technical evidence for why the two-tool design matters. The authors test variants of the same ARK pipeline while removing or weakening parts of Neighborhood Exploration.

The result is blunt. Removing Neighborhood Exploration barely damages AMAZON compared with MAG and PRIME, but it devastates relational datasets. On a 10% test subset, full ARK scores:

Setup	AMAZON Hit@1	MAG Hit@1	PRIME Hit@1
Full ARK	58.5	79.2	49.2
Without Neighborhood Exploration	54.5	30.5	23.1
Neighborhood without query ranking	56.0	72.1	44.7
Neighborhood without type filtering	55.5	79.2	42.2

The likely purpose of this table is ablation: isolate which tool design choices explain performance. The result supports three points.

First, local graph traversal is essential for multi-hop settings. Removing it collapses MAG and PRIME performance. This is the most direct evidence that ARK is not just BM25 with better prompting.

Second, ranking neighbors by an agent-generated query helps. Without query-based ranking, performance drops, especially on MAG. This suggests that even after the system enters a local neighborhood, it still needs textual discrimination to avoid high-degree noise.

Third, type filtering matters in heterogeneous graphs. PRIME has more node and relation types, and disabling type constraints damages it more visibly. In business terms: a graph retriever should know the difference between following a supplier relation, a synonym relation, a regulatory dependency, a citation, and a transaction. Otherwise, every edge becomes an invitation to wander.

This is where ARK becomes practically interesting. It is not proposing “more graph.” It is proposing controlled graph movement.

Compute is a dial, not a moral failing

Agentic retrieval costs more than single-pass retrieval. The paper does not hide this. It studies the tradeoff directly.

Figure 2 varies the maximum trajectory length and the number of parallel agents. The likely purpose is a sensitivity and compute-performance test. It shows that higher budgets generally improve Hit@1 and Recall@20, especially on MAG and PRIME, but also increase latency.

The numbers illustrate the shape of the tradeoff. On MAG, one agent with five steps reaches 46.9 Hit@1; three agents with twenty steps reach 73.4 Hit@1. Latency rises from 6.86 seconds to 13.63 seconds. On PRIME, the same movement goes from 25.8 Hit@1 to 48.3 Hit@1, with latency rising from 7.16 seconds to 12.74 seconds. AMAZON improves less because text search already solves more of the workload.

This is not a universal argument for throwing more agents at every query. It is a map of operating points.

A compliance assistant answering a high-risk regulatory question may tolerate a 12-second retrieval path if it reduces wrong evidence. A product-search assistant in a consumer app may not. A biomedical research assistant may need depth. A call-center agent may need fast lexical lookup most of the time, with deeper retrieval only for escalations.

The business design pattern is obvious but often ignored: retrieval budgets should be query- and risk-dependent. Not every question deserves the same graph walk.

The appendix adds a useful detail. Under the same total step budget of 30 steps, three shorter parallel trajectories outperform one longer trajectory across all three graphs. That is a robustness test for the parallel design. It suggests the gain is not merely “more total steps.” Multiple independent attempts produce diversity, and the aggregation rule extracts a consensus signal.

ARK aggregates parallel agents by ranking nodes according to how many agents selected them, breaking ties by earliest occurrence. Table 3 compares this voting rule with simple ordering and random merging.

Voting wins on all three datasets. The largest gains are on MAG and PRIME, where relational retrieval is harder. Random merging keeps Recall@20 surprisingly high in some cases because the correct node may still appear somewhere in the union, but Hit@1 collapses. That distinction matters. In downstream RAG, the top-ranked evidence often shapes the generated answer. Retrieval buried at rank 17 is not as useful as retrieval at rank 1, unless the downstream generator is exceptionally disciplined. It usually is not. We have all met generators.

The likely purpose of Table 3 is a component comparison: once multiple trajectories exist, how should the system merge them? The result supports the idea that independent explorations provide a signal about stable relevance. If several agents reach the same node through different paths or subqueries, the node deserves priority.

For enterprise systems, this creates a practical diagnostic. If three retrieval trajectories disagree completely, the answer should probably be treated as lower-confidence. If they converge, the system has a stronger basis for answer generation, citation, or escalation.

BM25 survives because the agent can reformulate and traverse

One of the paper’s more quietly useful findings appears in Table 4. The authors compare BM25 and dense retrieval inside ARK’s relevance function. Dense retrieval uses text-embedding-3-large.

The result is not the expected “dense wins because embeddings are modern.” BM25 performs better on AMAZON and PRIME Hit@1, while dense retrieval is slightly better on MAG. Overall, BM25 remains very competitive inside the agentic loop.

This is not a general anti-embedding conclusion. It is a relevance-function comparison inside a specific iterative system. The reason is plausible: the agent does not issue one query and accept the result. It can refine subqueries over time, use graph relations to discover nodes that text similarity misses, and rank local neighborhoods where lexical cues may be enough. Dense retrieval’s advantage in one-shot semantic matching does not automatically transfer to a multi-step search process.

For business architecture, this is a useful antidote to embedding maximalism. The retrieval stack should not be chosen by fashion. In an agentic graph retriever, the interface, action policy, filters, and stopping behavior may matter as much as the similarity function.

Sometimes the expensive magic vector is not the bottleneck. Painful, I know.

Distillation turns a retrieval behavior into a deployable policy

ARK’s best configuration uses a large proprietary model as the backbone. That is not ideal for cost-sensitive deployments. The paper therefore distills the teacher’s tool-use trajectories into smaller Qwen3 models.

This is not ordinary supervised retrieval training using ground-truth evidence labels. The student imitates the teacher’s interaction traces: tool calls, parameters, and decisions, with tool observations included in the trajectory and loss applied only to assistant-authored tokens. The authors use GPT-4.1 as the teacher, collect three trajectories per training query with stochastic decoding, and fine-tune Qwen3-8B using LoRA. The full 6,000-query-per-graph setting uses up to 18,000 trajectories per graph and 94.4 million tokens, with training completed in about five hours on a single H100.

The main result: distilled ARK reaches 49.51 average Hit@1, 66.31 Recall@20, and 58.47 MRR. It does not match the teacher, but it preserves a substantial fraction of the behavior. The abstract reports absolute Hit@1 gains over the base 8B model of +7.0 on AMAZON, +26.6 on MAG, and +13.5 on PRIME.

Figure 4 and Table 5 together serve as a distillation and budget analysis. They show that more trajectory data helps, that the 8B student generally benefits more than the 4B student in difficult settings, and that PRIME remains the hardest dataset. The paper notes that using 10% of the trajectories recovers roughly half of the total improvement achieved with the full training set, and that the 600-query setting can be distilled in about 30 minutes on a single H100.

For business use, the implication is not “fine-tuning solves cost.” The implication is more specific: tool-use behavior can be taught from traces even when relevance labels are unavailable. That is valuable because many enterprise graphs do not have clean labeled retrieval datasets. They do have queries, logs, expert workflows, and the ability to run expensive teacher systems offline.

This suggests a staged deployment path:

Stage	Practical action	Business purpose
Prototype	Use a strong teacher LLM with ARK-style tools on a target graph	Validate whether adaptive retrieval improves evidence quality
Logging	Save trajectories, tool calls, selected nodes, and failures	Build a training and audit asset without manual labeling
Distillation	Fine-tune a smaller model to imitate successful retrieval behavior	Reduce cost and latency
Routing	Use the distilled model for routine queries; reserve the teacher for hard or high-risk cases	Control unit economics without flattening quality
Audit	Track disagreement, path drift, and sensitive attribute exposure	Prevent graph retrieval from becoming confident nonsense

That is a real system design, not a demo script with edges.

Successful retrieval is selective, not merely deeper

Figure 5 examines the number of Neighborhood Exploration calls in successful versus unsuccessful trajectories on MAG and PRIME. Its likely purpose is diagnostic error analysis.

The result is subtle. Failed runs show two opposite failure modes. Some make no neighborhood calls at all, meaning the agent never realizes that relational evidence is needed. Others make too many neighborhood calls, suggesting drift through high-branching parts of the graph. Successful trajectories use neighborhood exploration sparingly and rarely exceed ten calls.

This matters because it corrects a common misconception: if graph retrieval fails, just expand deeper. No. Deeper traversal can be the problem. A graph is full of plausible distractions. Expanding more aggressively increases the chance of encountering semantically adjacent but answer-irrelevant nodes.

The better replacement belief is: good graph retrieval requires selective expansion and disciplined stopping.

For enterprise teams, this becomes an evaluation criterion. Do not only ask whether the retriever can perform multi-hop traversal. Ask whether it knows when not to.

What the paper directly shows, and what business readers should infer

The paper directly shows that ARK performs strongly on STaRK’s text-rich heterogeneous knowledge-graph retrieval tasks, especially compared with training-free baselines. It shows that the two-tool design matters through ablation. It shows that ARK’s tool use shifts with dataset query profile. It shows that compute budgets and parallel agents create measurable quality-latency tradeoffs. It shows that teacher trajectories can be distilled into a smaller Qwen3-8B model without using labeled relevance nodes.

Cognaptus should infer a broader but bounded business lesson: enterprise RAG over structured knowledge should be treated as adaptive evidence navigation. A good system should not merely retrieve chunks, nor blindly traverse graph paths. It should expose a small set of safe retrieval actions, let the model allocate breadth and depth per query, and record the trajectory so the behavior can be audited, improved, and possibly distilled.

This is especially relevant for:

Enterprise graph	Why adaptive retrieval helps	What to watch
Product catalogs	Descriptive search and relation-based discovery coexist	Synonyms, variants, and sparse product metadata can break lexical anchoring
Research and patent graphs	Queries often require author, citation, method, and topic paths	Graph density can create distracting neighborhoods
Biomedical knowledge bases	Important evidence may sit across drugs, genes, diseases, phenotypes, and papers	Sensitive or biased graph content requires strict governance
Compliance knowledge systems	Regulations, entities, obligations, and exceptions form relational chains	Wrong retrieval may be treated as authoritative evidence
Client or supplier graphs	Business questions often combine entity identity, ownership, transactions, and risk signals	Privacy, access control, and audit logging are non-negotiable

The inference boundary is equally important. ARK is tested on text-rich graphs. It may not transfer cleanly to sparse graphs with weak node descriptions. Its global search is lexical, so aliases, paraphrases, multilingual naming, and domain-specific vocabulary mismatches can hurt retrieval. The best configuration uses a large proprietary model, and agentic retrieval has higher latency than single-pass retrieval. Distillation helps, but the student still trails the teacher in harder regimes.

These are not fatal flaws. They are deployment constraints.

The enterprise version needs governance around the graph, not just around the answer

ARK’s ethical section is short but operationally meaningful. Agentic graph exploration can surface sensitive attributes, amplify biases in the graph, or cause retrieval errors to be treated as evidence. In a normal RAG system, the failure mode is often a bad chunk. In graph retrieval, the failure mode can be a bad path: the system may connect entities in a way that looks structured and therefore credible.

That credibility is dangerous.

A business implementation should therefore log not only the final evidence but also the route: which global searches were issued, which nodes were selected, which neighborhoods were explored, which filters were applied, and where the system stopped. This is not bureaucratic decoration. It is how teams debug retrieval behavior, detect drift, and decide whether a retrieved path is acceptable for customer-facing or regulated use.

The design should also enforce access controls at the tool level. The agent should not be able to explore sensitive node types merely because they are adjacent. Type filters should be permissions as well as retrieval aids.

The quiet lesson: retrieval is becoming an operating system

ARK’s most useful idea is not “LLMs can use graph tools.” We knew that. The useful idea is that retrieval itself is becoming an operating process: a sequence of actions, observations, decisions, budgets, and stopping rules.

That changes how companies should evaluate RAG systems.

The old question was: Which retriever gives the best top-$k$ results?

The better question is: Which retrieval policy knows how to move through the knowledge environment?

ARK answers that question with a minimal mechanism. Global search keeps the system from becoming locally trapped. Neighborhood exploration gives it relational depth. Parallel trajectories add robustness. Voting extracts consensus. Distillation makes the behavior cheaper. The result is not perfect, and the paper is clear about latency, model-size, lexical, and text-rich-graph boundaries. But the operating principle is strong.

Graph RAG should not be a static index with a prettier schema. It should breathe: wide when the query is broad, deep when the evidence is relational, and still enough to stop when the answer has been found.

That is not a bigger graph. It is a better retrieval metabolism.

Cognaptus: Automate the Present, Incubate the Future.

Joaquín Polonuer, Lucas Vittor, Iñaki Arango, Ayush Noori, David A. Clifton, Luciano Del Corro, and Marinka Zitnik, “Autonomous Knowledge Graph Exploration with Adaptive Breadth-Depth Retrieval,” arXiv:2601.13969v2, 2026, https://arxiv.org/abs/2601.13969. ↩︎

The real problem is not finding nodes; it is controlling movement#

ARK’s two tools are deliberately boring#

The benchmark result is strong, but the pattern matters more than the headline#

Figure 3 is the mechanism check, not a decorative chart#

The ablation says the neighborhood tool is not optional#

Compute is a dial, not a moral failing#

Voting works because retrieval confidence is partly social#

BM25 survives because the agent can reformulate and traverse#

Distillation turns a retrieval behavior into a deployable policy#

Successful retrieval is selective, not merely deeper#

What the paper directly shows, and what business readers should infer#

The enterprise version needs governance around the graph, not just around the answer#

The quiet lesson: retrieval is becoming an operating system#