TL;DR for operators

A user asks an AI agent to delete an account. The obvious tool is DeleteAccount. A normal semantic retriever will probably find it. Splendid. The agent still fails if it misses GetUserToken, because the deletion tool needs a token first. This is the failure mode Tool Graph Retriever, or TGR, is built to address.1

The paper’s core argument is simple but operationally sharp: tool retrieval should not treat each tool as an isolated document. In real workflows, tools depend on other tools. Some dependencies are result-based, where one tool needs another tool’s output. Others are verification-based, where one tool must authenticate, validate, or check something before another tool can run. These prerequisite tools are often semantically boring. They do not mention the user’s end goal. That is exactly why similarity search tends to miss them.

TGR adds dependency structure to retrieval. It first trains a discriminator to identify whether tool A depends on tool B, tool B depends on tool A, or no dependency exists. It uses that discriminator to build a directed dependency graph over candidate tools. Then it applies a simplified graph convolution step to update tool embeddings before online retrieval. The retrieval score is still query-to-tool similarity, but the tool representation now carries information from its dependency neighbourhood.

The evidence is benchmark-based, not production telemetry. On API-Bank, adding TGR to ToolBench-IR raises Pass Rate@10 from 0.624 to 0.788. On ToolBench-I1, the same pairing raises Pass Rate@10 from 0.690 to 0.730. The gains are consistent across Recall, NDCG, and Pass Rate, and the paper’s density analysis suggests TGR helps more where tool graphs are more interconnected. Translation: dependency-aware retrieval is most relevant where workflows are modular, multi-step, and unforgiving. Finance, identity, enterprise SaaS, compliance operations, logistics, and customer support may now raise a hand quietly.

The catch is also clear. TGR depends on the quality of the dependency graph. Manual dependency graphs outperform discriminator-built graphs in the API-Bank analysis. Graph construction has $O(N^2)$ complexity, because pairwise dependency checking gets expensive fast. So the business question is not “Should we use graphs?” but “Which tool domains are dependency-dense enough to justify graph construction, and how will we maintain dependency metadata without creating yet another internal archaeology project?”

The obvious tool is not always the usable tool

Tool retrieval sounds like a search problem until the agent actually has to execute something.

A user query says: “Update my email.” A semantic retriever sees UpdateEmail and beams proudly. It may not retrieve ValidateCredentials. It may not retrieve Login. From the retriever’s perspective, these tools are not very similar to the query. From the workflow’s perspective, they are not optional. The agent has retrieved the action without the permission to perform it. That is not intelligence. That is a very confident door handle without a key.

This is the misconception the paper usefully attacks: better semantic embeddings do not automatically solve tool retrieval. Semantic relevance and execution necessity are different things. The final action may be semantically obvious. The prerequisite may be semantically invisible.

In enterprise systems, this distinction is everywhere:

User-facing task Obvious action tool Easily missed prerequisite
Update customer address UpdateCustomerRecord AuthenticateUser, ValidateAddress, FetchCustomerID
Book shipment CreateShipment QuoteRate, CheckCoverage, ReserveInventory
Process refund IssueRefund VerifyOrder, CheckRefundPolicy, GetPaymentToken
Change employee role UpdateRole CheckApprover, FetchPolicy, AuditPermission
Close account DeleteAccount GetUserToken, ConfirmOwnership, ArchiveData

The paper’s mechanism-first value is that it identifies the missing object. The issue is not merely “retrieval quality.” The issue is retrieval completeness under dependency constraints.

A tool can be relevant in at least two ways. It can match the user’s intent directly. Or it can be required so that another relevant tool can run. TGR is built around the second category.

TGR turns prerequisite knowledge into retrieval structure

The paper defines a dependency between two tools in two cases.

First, a tool may require another tool’s result as input. If DeleteAccount needs a user token, and GetUserToken produces that token, then DeleteAccount depends on GetUserToken.

Second, a tool may require another tool for prior verification. If Login requires ValidateCredentials, then Login depends on ValidateCredentials.

That gives the system a directed relationship:

Action tool → prerequisite tool
DeleteAccount → GetUserToken
Login → ValidateCredentials
UpdateEmail → Login

This direction matters. A dependency graph is not just a cluster of related APIs. It encodes execution preconditions. Confusing “relatedness” with “dependency” would turn the method into a fancy synonym engine, which would be more expensive and only slightly more decorative.

TGR has three stages:

  1. Dependency identification: train a model to classify tool-pair relationships.
  2. Graph-based tool encoding: represent tools as nodes and dependency relationships as directed edges.
  3. Online retrieval: retrieve tools using dependency-enriched embeddings.

The online stage still looks familiar. A user query is embedded, tool embeddings are compared, and top-$k$ tools are returned. The important difference is upstream: the tool vectors have already been shaped by dependency information.

The simplified flow is:

Tool documents
Dependency discriminator
Directed tool graph
Graph convolution over tool embeddings
Dependency-aware retrieval

That design choice is quietly practical. TGR does not ask the LLM planner to rediscover every prerequisite during execution. It tries to make the retriever hand the planner a more complete working set from the beginning.

The discriminator is the graph’s weakest and most necessary organ

Dependency graphs do not appear by magic, unless one works in vendor demos, where everything appears by magic and then invoices you.

The authors build TDI300K, a dataset for tool dependency identification. The task is a three-class classification problem over a pair of tools:

Class Meaning
$t_a \rightarrow t_b$ Tool $t_a$ depends on tool $t_b$
$t_a \times t_b$ No dependency exists
$t_a \leftarrow t_b$ Tool $t_b$ depends on tool $t_a$

The paper uses a two-stage construction and training strategy.

The pretraining portion is derived from CodeSearchNet. The authors use LLM-based agents to extract tool-style documentation from real function implementations, generate dependent tool documentation, and validate whether the dependency satisfies their criteria. This produces a balanced pretraining set: 92,000 examples in each of the three classes.

The finetuning portion is more realistic and more awkward, as reality likes to be. It is manually constructed from real function tools collected from open-source datasets, projects, and libraries. Its class distribution is heavily imbalanced: 1,029 examples for one dependency direction, 33,365 no-dependency examples, and 1,056 for the opposite direction. That imbalance reflects the real sparsity of dependencies. Most tool pairs do not depend on each other. Most API catalogues are not beautifully organised dependency poems.

The discriminator is BERT-base-uncased. The paper reports validation precision of 0.775, recall of 0.814, and F1 of 0.792. On the manually annotated API-Bank-derived test set, precision is 0.893, recall is 0.760, and F1 is 0.817.

Those numbers are good enough to make the method work in the benchmark. They are not good enough to make graph quality disappear as an implementation concern.

That distinction matters for operators. In a live enterprise system, a false positive dependency can pull irrelevant tools into context. A false negative can preserve the original failure mode by omitting the prerequisite. Dependency identification is therefore not a housekeeping detail. It is the load-bearing step.

Graph convolution makes boring prerequisites visible

Once TGR has a dependency graph, it updates tool embeddings using graph convolution. The graph is directed, with tools as nodes and dependencies as edges. If $t_a$ depends on $t_b$, the graph connects $t_a$ to $t_b$.

The paper uses a parameter-free graph convolution variant:

$$ G(X, A) = D^{-\frac{1}{2}}(A + I)D^{-\frac{1}{2}}X $$

Here, $X$ is the tool embedding matrix, $A$ is the adjacency matrix, $I$ adds self-connections, and $D$ is the degree matrix. The authors remove trainable GCN parameters to speed the retrieval process.

Operationally, this means a tool’s representation is blended with representations from its dependency neighbourhood. If DeleteAccount is semantically close to “delete my account,” then its prerequisite GetUserToken can inherit useful retrieval signal through the graph. The prerequisite becomes more retrievable not because its own description suddenly mentions deletion, but because the graph has told the embedding space that it belongs in that execution chain.

This is the paper’s most important conceptual move. It does not try to make every prerequisite tool semantically self-explanatory. It changes the representation so that execution relationships affect retrieval.

That is also why TGR is compatible with existing embedding retrievers. The paper tests TGR as an enhancement to Paraphrase MiniLM-L3-v2 and ToolBench-IR rather than as a replacement for retrieval models. In business terms: this is a wrapper architecture around retrieval, not a demand to throw away the whole search stack. Very considerate, for once.

The main evidence: better retrieval completeness, not just prettier rankings

The paper evaluates TGR on API-Bank and ToolBench-I1. API-Bank has 311 test samples across three levels. ToolBench is limited to I1 because I2 and I3 involve APIs across different categories, making graph construction more complex and costly.

The metrics are Recall, NDCG, and Pass Rate at top-5 and top-10 retrieval.

The most operator-relevant metric is Pass Rate. The paper defines Pass Rate@k as the proportion of test samples where all required tools are included in the top-$k$ retrieved set. That is stricter than retrieving one correct tool. It asks whether the agent received the full tool set needed to execute the task.

A condensed view of the strongest comparisons:

Dataset Base retriever Pass Rate@10 without TGR Pass Rate@10 with TGR Interpretation
API-Bank Paraphrase MiniLM-L3-v2 0.592 0.698 TGR improves completion-oriented retrieval for a general embedding model
API-Bank ToolBench-IR 0.624 0.788 TGR gives the strongest API-Bank result in the reported table
ToolBench-I1 Paraphrase MiniLM-L3-v2 0.250 0.450 Large relative gain from a weaker starting point
ToolBench-I1 ToolBench-IR 0.690 0.730 Smaller but still positive gain over a specialised retriever

Recall and NDCG also improve across the reported TGR pairings. For example, on API-Bank, ToolBench-IR improves from Recall@10 of 0.790 to 0.878, and from NDCG@10 of 0.670 to 0.712. On ToolBench-I1, ToolBench-IR improves from Recall@10 of 0.841 to 0.868, and from NDCG@10 of 0.807 to 0.829.

The practical interpretation is not “graphs beat embeddings.” That would be too neat, and therefore suspicious. The better reading is: dependency graphs can add complementary information to embeddings. ToolBench-IR remains strong, especially on ToolBench-I1, because it is already finetuned for tool retrieval and fits ToolBench’s structured tool-document format. TGR improves it anyway, which is the interesting part.

A finetuned retriever learns semantic and task-specific retrieval patterns. TGR adds structural information about what tools require each other. These are not the same signal.

The ablations are about graph quality and retrieval mechanics

The paper includes several analyses beyond the main table. They should not all be treated as equal evidence. Some are main evidence. Some are ablations. Some are robustness checks. Some are illustrative case studies.

Test or analysis Likely purpose What it supports What it does not prove
API-Bank and ToolBench-I1 main results Main evidence TGR improves Recall, NDCG, and Pass Rate over tested baselines General production reliability across all enterprise API systems
Discriminator validation/test performance Implementation quality check The dependency classifier is usable enough to build graphs for experiments That automated dependency discovery is solved
Manual graph vs discriminator graph Ablation on graph quality Better dependency graphs improve TGR performance That manual annotation is scalable
Graph density analysis Sensitivity/robustness test TGR gains rise as dependency graphs become denser That every industry with complex workflows will see the same slope
Case studies Mechanism illustration Shows how missed prerequisites are recovered Statistical evidence by itself
Similarity method appendix Implementation detail/robustness check Cosine similarity works best among tested scoring methods That similarity scoring is the core contribution

The manual graph comparison is especially revealing. On API-Bank, TGR using manually annotated graphs outperforms TGR using discriminator-built graphs. With ToolBench-IR, the discriminator graph gives Pass Rate@10 of 0.788; the manual graph gives 0.817. With Paraphrase MiniLM-L3-v2, the discriminator graph gives 0.698; the manual graph gives 0.711.

This is not a defect in the paper. It is a useful warning label. The method is only as good as the dependency graph it constructs. Better dependency identification leads to better retrieval. Excellent. Also inconvenient.

For enterprise deployment, that means dependency metadata should not be treated as a one-off data-preparation task. It should be part of API governance. Versioning, schema changes, authentication flows, permissions, and deprecations can all change dependency structure. A graph that is correct in March may become charmingly obsolete by June.

Dependency density tells us where the method is likely to matter most

The density analysis asks a sensible question: does TGR help more when the tool graph has more dependencies?

The authors group ToolBench APIs by category, rank categories by graph density, and measure the Recall@5 increment of ToolBench-IR+TGR over ToolBench-IR. The result shows an upward trend: as graph density increases, the recall increment tends to increase.

This supports the paper’s mechanism. If dependencies are the useful missing signal, then dependency-aware retrieval should be more valuable where dependencies are frequent. That is what the analysis suggests.

The business interpretation is straightforward but should not be overstretched.

TGR is likely more valuable in domains where tools are:

  • granular rather than monolithic;
  • sequenced rather than independent;
  • permissioned rather than freely callable;
  • stateful rather than stateless;
  • coupled by tokens, IDs, validation, or intermediate outputs.

A catalogue of standalone utilities may not benefit much. A modular enterprise workflow environment probably benefits more. The paper itself notes that fully featured tools are less likely to depend on others, while fine-grained specialised tools usually have more intensive dependencies. That sentence deserves more attention than it will get, because it quietly tells you when not to bother.

If your system exposes large, self-contained tools, semantic retrieval may already be adequate. If your system exposes thousands of narrow APIs that must be chained correctly, dependency-aware retrieval becomes much more interesting.

The case studies show the failure mode in miniature

The API-Bank case study is almost too clean, but it clarifies the mechanism.

The query asks to delete an account. The ground truth tools are GetUserToken and DeleteAccount. The dependency is:

DeleteAccount → GetUserToken

ToolBench-IR retrieves DeleteAccount at rank 1 but misses GetUserToken in the top five. TGR-enhanced ToolBench-IR retrieves GetUserToken at rank 1 and DeleteAccount at rank 5.

There is a trade-off hiding here. TGR changes the ranking enough that the prerequisite rises sharply, while the semantically obvious action tool moves lower but remains in the top-five set. For Pass Rate, that is a win: both required tools are retrieved. For a planner with a very small context budget, rank shifts still matter. Dependency-aware retrieval is not simply “more tools good.” It is a ranking intervention under a fixed top-$k$ constraint.

The ToolBench case study has the same pattern. The query asks which football league predictions are available today and mentions Premier League and La Liga. The ground truth includes Get Today’s Predictions and Get Next Predictions, with Get Next Predictions depending on Get Today’s Predictions. The base retriever gets Get Next Predictions but misses the prerequisite. TGR retrieves both in the top five.

These examples are not proof by themselves. They are microscope slides. They show the paper’s core pathology clearly: the semantically obvious tool is not enough.

What the paper directly shows

The paper directly shows four things.

First, adding dependency-aware graph encoding to tested retrievers improves retrieval metrics on API-Bank and ToolBench-I1. The improvements are consistent across Recall, NDCG, and Pass Rate in the reported experiments.

Second, TGR can improve both a general sentence embedding model and a tool-specialised retriever. That suggests dependency structure is not merely compensating for a weak baseline.

Third, graph quality matters. Manual dependency graphs outperform discriminator-built graphs in the API-Bank comparison, which means dependency identification accuracy is a real performance lever.

Fourth, dependency density appears to moderate benefit. TGR’s recall improvement tends to rise with graph density in the ToolBench category analysis.

That is already useful. It is also narrower than “TGR solves tool retrieval.” The paper does not show live production performance. It does not show end-to-end business process completion under real API errors, rate limits, permissions, policy constraints, or changing schemas. It does not evaluate ToolBench I2 or I3, where cross-category tool dependencies become more complicated.

This is not a scandal. Academic papers are allowed to have scope. We can all survive.

What Cognaptus infers for business use

The business value is not “better search.” Better search is a phrase people use when they want funding but have not yet found the failure mode.

The value is cheaper execution reliability.

For AI agents in enterprise settings, missing a prerequisite tool can cause a task to fail before the model even gets a fair chance to reason. The LLM planner cannot call a tool it never sees. Enlarging the context window may reduce the issue, but stuffing hundreds or thousands of tools into context is costly, noisy, and not a serious orchestration strategy unless one enjoys watching latency and confusion become colleagues.

Dependency-aware retrieval offers a more disciplined alternative: retrieve the execution chain, not merely the keyword match.

That has implications for agent platform design:

Technical contribution Operational consequence ROI relevance
Dependency discriminator Automates discovery of prerequisite relationships Reduces manual mapping cost, but needs quality control
Tool dependency graph Makes workflow structure available before planning Improves chance of retrieving complete tool sets
Graph-enhanced embeddings Pulls prerequisites into retrieval space Reduces failures caused by semantically invisible setup tools
Pass Rate evaluation Measures whether all required tools were retrieved Better aligned with task completion than single-tool recall
Density analysis Identifies where TGR is more useful Helps prioritise high-dependency domains first

The near-term practical target is not every API in the company. It is the workflow classes where tool dependencies are dense and failures are expensive: identity operations, regulated customer updates, financial transactions, order management, claims processing, internal IT automation, and compliance workflows.

Start where missing one prerequisite creates a real failure, not where it merely creates a less elegant demo.

What remains uncertain before production adoption

The paper’s limitations are not cosmetic. They shape deployment.

The first issue is graph construction cost. The authors state that graph construction has $O(N^2)$ time complexity. Pairwise dependency checking becomes expensive as the tool catalogue grows. Enterprises rarely have the luxury of small, stable, lovingly documented API sets. They have inherited systems, duplicate tools, contradictory names, and endpoints last touched by someone now managing a vineyard.

The second issue is dependency drift. APIs change. Authentication flows change. New intermediate IDs appear. Permissions are split. Rate limits alter workflow design. A dependency graph must be maintained, not merely generated.

The third issue is classifier trust. The discriminator’s reported F1 scores are promising, but production environments may contain domain-specific tools with obscure side effects. False dependencies can pollute retrieval. Missed dependencies can preserve failure. Human review may still be necessary for high-risk workflows.

The fourth issue is benchmark scope. API-Bank and ToolBench-I1 are useful evaluation settings, but they do not fully represent production agents operating under permissions, data sensitivity, API failures, retries, asynchronous jobs, and compliance constraints. The paper leaves ToolBench I2 and I3 for future work because cross-category graph construction is harder. That is exactly where many enterprise workflows live.

Finally, TGR improves retrieval, not planning correctness. Giving an agent the right tools does not guarantee it will call them in the right order, with the right arguments, under the right policy. It simply removes one common excuse.

The practical playbook: where to try dependency-aware retrieval first

A sensible implementation path would not begin with a grand graph of the entire enterprise. That way lies meetings, taxonomy wars, and possibly mild spiritual damage.

A better path:

  1. Select one dependency-heavy workflow family. Choose a domain where task failures often involve missing setup, authentication, validation, lookup, or token tools.
  2. Build a small gold dependency graph. Manually annotate a representative subset. This gives a quality benchmark and prevents blind trust in automated dependency extraction.
  3. Train or adapt a dependency classifier. Use domain-specific API documentation, schemas, logs, and workflow traces if available.
  4. Compare semantic retrieval against dependency-aware retrieval. Use Pass Rate-style metrics: did the retrieved set include every required tool?
  5. Test under constrained top-$k$. The point is not to retrieve everything. The point is to retrieve enough.
  6. Monitor graph drift. Treat dependency metadata as part of API lifecycle management.

The important operational question is not whether TGR is clever. It is. The question is whether your tool environment has enough hidden prerequisite structure to make cleverness worth the maintenance.

The bottom line: agents fail at the seams

TGR is a useful reminder that agent failures often happen between tools, not inside them.

A model can understand the user’s request. It can retrieve the final action tool. It can still fail because the retrieval layer did not provide the prerequisites that make the action executable. This is the missing-link problem: the tool chain breaks at the seam between semantic relevance and operational dependency.

The paper’s contribution is to make that seam explicit. It reframes tool retrieval from “find tools that sound like the query” to “find tools that can support the execution path.” That is a better mental model for enterprise agents, where workflows are rarely single-step and almost never as clean as the product brochure.

The most interesting result is not merely that TGR improves benchmark numbers. It is that the improvement comes from structural knowledge embeddings do not naturally capture: authentication before update, token before deletion, today’s prediction before next prediction. Boring prerequisites. Essential prerequisites. The kind of thing systems remember and demos forget.

For businesses building agentic automation, the lesson is blunt: do not only index what tools say they do. Map what they need from each other. Otherwise, your agent may keep retrieving the right tool for the wrong moment, which is a very modern way to fail.

Cognaptus: Automate the Present, Incubate the Future.


  1. Linfeng Gao, Yaoxiang Wang, Minlong Peng, Jialong Tang, Yuzhe Shang, Mingming Sun, and Jinsong Su, “Tool Graph Retriever: Exploring Dependency Graph-based Tool Retrieval for Large Language Models,” arXiv:2508.05152, 2025, https://arxiv.org/abs/2508.05152↩︎