When AI Agents Read the Manual: Why τ-Knowledge Exposes the Limits of LLM Reasoning

A customer asks a banking agent to handle a routine request.

Freeze a card. Replace a lost wallet. Open a better savings account. Close an old credit card. Apply a referral bonus. Nothing here sounds like artificial general intelligence. It sounds like Tuesday morning in a customer support queue.

Then the agent has to read the internal policy, discover which tool exists, verify the customer’s account state, notice that one action blocks another, decide whether the user’s claim needs verification, and make the right database update.

This is where the cheerful “just connect the LLM to your knowledge base” story begins to sweat.

The paper behind τ-Knowledge builds a benchmark around exactly this kind of operational mess: conversational agents working over unstructured enterprise knowledge, where success depends not only on retrieving documents but on using them to perform correct, policy-compliant, state-changing actions.¹ The benchmark’s new domain, τ-Banking, contains 698 knowledge documents, 51 discoverable tools, and 97 customer-support tasks. The best non-gold configuration reaches only 25.52% pass^1. Even when the necessary documents are handed to the agent directly, the strongest result rises only to 39.69% pass^1.

That is the important part. The agent is not merely lost because it cannot find the manual. Sometimes it reads the manual and still does the wrong thing.

The benchmark is hard because knowledge controls action

Most enterprise AI discussions treat knowledge retrieval as a plumbing problem. Index the documents, embed the chunks, retrieve top-k passages, attach them to the prompt, and the model should behave. This framing is convenient. It is also too tidy.

τ-Knowledge makes the problem harder in a way that resembles actual business operations. The knowledge base is not just a source of facts. It contains product terms, procedures, internal policies, and documentation for tools. Some tools are not initially visible to the agent. The agent must discover them in the documentation before it can use them through a generic tool-calling interface.

That design matters because knowledge access changes the agent’s action space. A human employee who does not know the correct internal system cannot complete the operation. An AI agent has the same problem, except it is also perfectly capable of sounding confident while holding the wrong procedure in its tiny theatrical hands.

The task loop therefore looks like this:

The user states a goal, often incompletely.
The agent searches the knowledge base.
It reads policy and product documents.
It discovers which tools are available.
It calls tools to observe or modify the banking database.
It continues the conversation as the user reveals more information.
The final database state is checked against the expected target state.

The benchmark is not asking whether a model can answer a question about banking policy. It asks whether the model can turn policy into action under partial observability.

That distinction is the whole article.

τ-Banking turns customer support into a state-changing reasoning test

The τ-Banking domain is built around realistic fintech support workflows: opening and closing accounts, handling card replacement, applying rewards, disputing transactions, redeeming referral promotions, and recommending products under constraints.

The paper reports the following basic scale:

Component	τ-Banking scale	Why it matters
Knowledge documents	698	Enough for realistic search noise, but small enough for controlled evaluation
Total tokens	194,562	Large enough to challenge context and retrieval strategies
Knowledge categories	21	Product and policy diversity creates cross-document dependencies
Discoverable tools	51	Agents must learn capabilities from documentation
Tasks	97	Enough to compare models and retrieval configurations
Average required documents per task	18.6	Tasks require multi-document reasoning, not one-shot lookup
Average required tool calls per task	9.52	Correct execution is multi-step
Maximum required tool calls	33	Some tasks require long-horizon orchestration

The benchmark construction is also worth noting. The authors do not simply ask an LLM to improvise a pile of fake banking articles and hope the result is coherent. They begin with a structured database of product and policy variables, convert that structure into unstructured documents, then create tasks and expected final database states around the underlying variables. Human reviewers audit the tasks, gold documents, and valid solution trajectories.

This structured-to-unstructured pipeline is not just a dataset-generation trick. It gives the benchmark a useful property: the unstructured documents resemble enterprise documentation, but the authors can still verify whether a task is well-posed. In business terms, τ-Banking approximates the annoying middle ground between a clean database and a real company wiki, where the information exists but refuses to behave like a spreadsheet.

The mechanism of failure has four moving parts

A simple benchmark summary would say: “Models scored poorly.” True, but not very useful.

The more useful question is: where does the failure enter?

In τ-Knowledge, an agent can fail through at least four mechanisms.

First, it can fail to retrieve the right documents. This is the familiar RAG problem: the correct policy or product page never appears in context.

Second, it can retrieve the documents but misread their implications. This is the knowledge-use problem. The relevant text is present, but the agent does not combine rules correctly.

Third, it can discover or call the wrong tools. In τ-Banking, tools are part of the operational policy layer. Knowing the answer is not enough; the agent must execute the right state-changing operation.

Fourth, it can execute correct-looking actions in the wrong order. This is the silent killer in business workflows. A tool call may be valid in isolation but invalid after a different action changes the state.

These mechanisms interact. A bad search can lead to wrong reasoning. Wrong reasoning can lead to the wrong tool. The wrong tool can change the database, after which the next step becomes impossible. The model then produces an apologetic explanation, which is the corporate version of smoke coming out of the machine.

The paper’s best contribution is that it makes these failures observable. The final database state provides a hard evaluation target. The agent either completed the operational task or it did not.

The headline scores are low, but the gold-document result is the real warning

The main results are easy to state and harder to digest.

Configuration	Best reported result	Interpretation
Best non-gold setup	GPT-5.2 high reasoning + terminal search: 25.52% pass^1	Search plus reasoning still solves only about one quarter of tasks
Best pass^4 non-gold setup	GPT-5.2 high reasoning + Qwen3 embedding: 13.40% pass^4	Repeated-trial reliability drops sharply
Best gold-document setup	Claude-4.5-Opus high reasoning: 39.69% pass^1	Removing retrieval bottlenecks does not solve the benchmark
Best gold-document pass^4	Claude-4.5-Opus high reasoning: 26.80% pass^4	Even with required documents supplied, reliability remains limited

The gold-document condition deserves more attention than the leaderboard. In this configuration, the agent receives the task-critical documents directly in context. This does not remove every difficulty, but it does remove a major retrieval bottleneck.

If the industry’s favorite explanation were sufficient — “the model just needs better retrieval” — gold documents should produce a much larger rescue. Instead, the strongest model remains below 40% pass^1.

That result shifts the diagnosis. Retrieval matters, but the deeper weakness is operational reasoning over policy, tools, and state.

The benchmark also uses pass^k, where success requires completion across independent trials. This is important because production systems do not only need occasional success. A support agent that completes a banking workflow correctly one out of four times is not “promising.” It is a compliance incident with a friendly tone.

Search method matters, but not in the way RAG dashboards usually imply

The paper compares dense retrieval, sparse BM25 retrieval, terminal-style search, and gold-document access.

Terminal-based search gives strong models a useful advantage. The terminal condition exposes the knowledge base as files and lets the agent use shell utilities such as grep, cat, and find. Averaged across models, terminal use outperforms dense and sparse retrieval with statistical significance.

But this benefit is uneven. Recent high-reasoning models benefit most. GPT-5.2 without reasoning effort and older GPT models do not gain the same advantage. This suggests that freeform search is not a magic interface. It helps only when the model can plan a search strategy, inspect intermediate results, and revise the search path intelligently.

The cost side is less glamorous. Dense retrieval averages around 9.9–10.1 searches per task. BM25 averages 11.4. Terminal use involves more exploration: the paper reports 28.8 shell calls per task on average across shell commands. Median turn time rises by 6.6 seconds relative to dense retrieval configurations.

So terminal access can improve performance, but it can also create a more expensive, slower, and noisier support interaction. For an internal research agent, that may be acceptable. For a customer-facing banking agent, “please enjoy this extended archaeological dig through our documentation” is not a product feature.

This is one of the paper’s most business-relevant points: retrieval quality and solution efficiency are separate metrics. An agent can preserve some success rate by searching harder, but customers experience that as latency, backtracking, and conversational sludge.

The appendix tests robustness, not a second thesis

Several additional experiments are easy to misread. They are not separate grand claims about the future of agent architecture. Their purpose is narrower: to check whether the main results are artifacts of retrieval configuration or benchmark design.

Test or analysis	Likely purpose	What it supports	What it does not prove
Gold-document condition	Separate retrieval failure from knowledge-use failure	Agents still fail when the key documents are supplied	Retrieval is unimportant
Document recall	Measure whether gold documents appeared in context	Search success depends on both retriever and agent query behavior	Document presence guarantees correct action
No-knowledge setup	Confirm tasks require the knowledge base	Average pass^1 drops to 2% without KB access	All failures are caused by missing documents
Long-context full-KB setup	Test whether irrelevant documents create realistic noise	Full context peaks at only 12% pass^1 for tested models	Long-context models are useless generally
Reranker ablation	Check whether pointwise reranking improves retrieval setup	No significant pass^1 improvement in reported settings	Reranking can never help
Grep-tool ablation	Check whether simple keyword search improves retrieval setup	No significant pass^1 improvement for most retrievers	Keyword search is irrelevant in all enterprise systems
Retrieved-document count	Tune top-k retrieval against performance and latency	k=20 does not significantly beat k=10 in reported settings	More context is always harmful

The cleanest practical lesson is not “use terminal search” or “do not use rerankers.” The lesson is more awkward: once an agent is expected to reason, search, and execute, the performance bottleneck moves around. Sometimes the retriever is the bottleneck. Sometimes the model’s search behavior is the bottleneck. Sometimes the document is already present and the agent still cannot convert it into the right action.

This is exactly why a simple RAG evaluation dashboard can be misleading. A retrieval metric may improve while task success barely moves.

The failures look painfully familiar to anyone who has managed operations

The qualitative analysis is where τ-Knowledge becomes more than a benchmark table.

The paper identifies four recurring error modes across failed trajectories. These errors are not exotic. They are the same mistakes a weak junior employee might make after skimming a policy portal.

Failure mode	What happens	Business translation
Product interdependency mistakes	The agent notices one incentive but misses a better product-policy combination	The system optimizes a surface feature instead of the customer’s actual objective
Wrong subtask ordering	The agent executes actions in user-stated order even when policies require a different sequence	Valid actions become invalid because the workflow was not topologically planned
Overtrusting user assertions	The agent accepts user claims that should be checked against system state	The agent treats conversation as evidence when the database should be authority
Search inefficiency and assumptions	The agent commits to an underspecified interpretation or searches unfocusedly	Ambiguity becomes cost, latency, and wrong resolution

The subtask-ordering example is especially useful. In one task, the user wants to dispute a transaction and request a credit limit increase. Bank policy rejects credit limit increases when disputes are pending. The correct path is not simply to satisfy both requests. The agent must infer that one operation should precede the other.

This is not a retrieval problem. The policy can be available. The agent still needs to build a dependency graph over actions.

That is a very different mental model from “answer the user’s question.” Enterprise agents need workflow reasoning. They must understand not only what a rule says, but when applying it changes the future action space.

The uncomfortable lesson for enterprise RAG: retrieved text is not operational control

Many businesses are still evaluating AI assistants as if the central question were: “Did the system retrieve a relevant chunk?”

τ-Knowledge suggests a better question: “Did the system produce the right final state under policy constraints?”

Those are not equivalent.

A retrieved paragraph can support a correct explanation while the agent still fails to update the database. A tool call can be technically valid while violating a hidden ordering constraint. A user’s statement can sound plausible while contradicting the system state. A product recommendation can satisfy one advertised benefit while failing the true optimization target.

For business deployment, the evaluation stack should therefore separate at least six layers:

Evaluation layer	Example metric	Why it matters
Retrieval coverage	Did gold or relevant documents appear in context?	Diagnoses search failure
Policy interpretation	Did the agent apply the rule correctly?	Diagnoses reasoning failure
Tool discovery	Did the agent identify the needed operation?	Diagnoses capability-access failure
Action sequencing	Were state-changing steps ordered correctly?	Diagnoses workflow-planning failure
Final-state correctness	Did the database match the target state?	Measures operational success
Efficiency	Time, turns, tokens, tool calls	Measures deployment cost and user experience

Most RAG pilots overmeasure the first layer and undermeasure the rest. This is understandable. Retrieval metrics are easier to compute. They also make vendors look better at demos, which is one of their natural ecological functions.

But in customer service, finance, insurance, HR, procurement, compliance, and internal operations, the value is not “the model cited the right policy.” The value is that the correct operation happened, with minimal risk and reasonable latency.

What Cognaptus would infer for business adoption

The paper directly shows that frontier models perform poorly on a simulated banking benchmark requiring unstructured knowledge retrieval, tool discovery, policy reasoning, and state-changing execution. It also shows that gold-document access improves performance but does not solve the task, and that search interface choices affect both success and latency.

From that, Cognaptus would infer several practical deployment principles.

First, do not treat RAG accuracy as the final acceptance test. RAG accuracy is a diagnostic, not the business outcome. The production acceptance test should include final-state verification, policy compliance, tool-call correctness, and repeated-run reliability.

Second, workflow maps are not optional. If an agent can perform actions that alter state, the system needs explicit modeling of dependencies: which action must happen first, which action blocks another, which user claim must be verified, and which tool output is authoritative.

Third, tool discovery should be governed rather than improvised. τ-Banking makes discoverable tools a benchmark feature. In production, that suggests a design question: should an agent freely discover tools from documentation, or should documentation be compiled into a controlled capability registry with versioning, permissions, and tests? The latter sounds less magical. It also sounds less likely to accidentally waive a fee because the user sounded persuasive.

Fourth, efficiency belongs in the business case. The paper shows that agents can compensate for weaker retrieval with more search and longer interactions. This matters because token cost is not the only cost. Customer time, support escalation, audit review, and operational delay all belong in the ROI model.

Fifth, the right benchmark should resemble the workflow being automated. A legal intake agent, insurance claims assistant, accounting support bot, or HR policy agent needs a task suite with realistic documents, tool calls, state transitions, and edge cases. Generic reasoning scores are useful background information. They are not a deployment certificate.

Where the result applies, and where it should not be overread

τ-Banking is simulated. The policies, products, users, and tools are synthetic or benchmark-constructed rather than copied from a real bank. That limits direct translation into production failure rates. A company should not read 25.52% pass^1 and conclude that its own agent will succeed exactly one quarter of the time.

The domain is also fintech customer support. The specific difficulty profile may differ in software support, healthcare administration, logistics, or internal IT. Some workflows have cleaner tool APIs. Some have better-maintained documentation. Some are worse. Many companies know which category they belong to and are quietly hoping no one asks.

The user simulator is LLM-based, although the authors audit sampled traces and report a low rate of task-critical user errors. This supports the validity of the benchmark but does not eliminate all simulator-related concerns.

Finally, the tested models and retrieval methods represent a point in time. Future models may improve. Better agent scaffolding, memory management, constrained planners, policy compilers, and human-in-the-loop controls may raise performance substantially.

None of these boundaries weaken the central lesson. They only keep it properly sized. τ-Knowledge is not a prophecy about every enterprise agent. It is a controlled demonstration that operational knowledge use is harder than document retrieval, and current systems still stumble when the manual becomes part of the workflow.

The real benchmark is whether the agent can work

The most useful sentence to take from τ-Knowledge is not “frontier models failed.” That is too easy, and slightly boring.

The better takeaway is this: enterprise agents fail through mechanisms that ordinary RAG evaluations do not see.

They fail when product rules interact. They fail when action order matters. They fail when user statements require verification. They fail when the correct tool is hidden in documentation. They fail when more context creates more noise. They fail when search becomes a substitute for understanding.

That is why τ-Knowledge is valuable. It shifts the evaluation target from “Can the model retrieve information?” to “Can the model use operational knowledge to change the world correctly?”

For businesses, this is the difference between a chatbot and an agent. A chatbot can explain a policy. An agent can execute it. The second task is where the money is, and conveniently, where the failures are harder to hide.

The manual was never the whole solution. It was only the beginning of the test.

Cognaptus: Automate the Present, Incubate the Future.

Quan Shi, Alexandra Zytek, Pedram Razavi, Karthik Narasimhan, and Victor Barres, “τ-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge,” arXiv:2603.04370, 2026. ↩︎

The benchmark is hard because knowledge controls action#

τ-Banking turns customer support into a state-changing reasoning test#

The mechanism of failure has four moving parts#

The headline scores are low, but the gold-document result is the real warning#

Search method matters, but not in the way RAG dashboards usually imply#

The appendix tests robustness, not a second thesis#

The failures look painfully familiar to anyone who has managed operations#

The uncomfortable lesson for enterprise RAG: retrieved text is not operational control#

What Cognaptus would infer for business adoption#

Where the result applies, and where it should not be overread#

The real benchmark is whether the agent can work#