A customer asks a banking agent to handle a routine request.
Freeze a card. Replace a lost wallet. Open a better savings account. Close an old credit card. Apply a referral bonus. Nothing here sounds like artificial general intelligence. It sounds like Tuesday morning in a customer support queue.
Then the agent has to read the internal policy, discover which tool exists, verify the customer’s account state, notice that one action blocks another, decide whether the user’s claim needs verification, and make the right database update.
This is where the cheerful “just connect the LLM to your knowledge base” story begins to sweat.
The paper behind τ-Knowledge builds a benchmark around exactly this kind of operational mess: conversational agents working over unstructured enterprise knowledge, where success depends not only on retrieving documents but on using them to perform correct, policy-compliant, state-changing actions.1 The benchmark’s new domain, τ-Banking, contains 698 knowledge documents, 51 discoverable tools, and 97 customer-support tasks. The best non-gold configuration reaches only 25.52% pass^1. Even when the necessary documents are handed to the agent directly, the strongest result rises only to 39.69% pass^1.
That is the important part. The agent is not merely lost because it cannot find the manual. Sometimes it reads the manual and still does the wrong thing.
The benchmark is hard because knowledge controls action
Most enterprise AI discussions treat knowledge retrieval as a plumbing problem. Index the documents, embed the chunks, retrieve top-k passages, attach them to the prompt, and the model should behave. This framing is convenient. It is also too tidy.
τ-Knowledge makes the problem harder in a way that resembles actual business operations. The knowledge base is not just a source of facts. It contains product terms, procedures, internal policies, and documentation for tools. Some tools are not initially visible to the agent. The agent must discover them in the documentation before it can use them through a generic tool-calling interface.
That design matters because knowledge access changes the agent’s action space. A human employee who does not know the correct internal system cannot complete the operation. An AI agent has the same problem, except it is also perfectly capable of sounding confident while holding the wrong procedure in its tiny theatrical hands.
The task loop therefore looks like this:
- The user states a goal, often incompletely.
- The agent searches the knowledge base.
- It reads policy and product documents.
- It discovers which tools are available.
- It calls tools to observe or modify the banking database.
- It continues the conversation as the user reveals more information.
- The final database state is checked against the expected target state.
The benchmark is not asking whether a model can answer a question about banking policy. It asks whether the model can turn policy into action under partial observability.
That distinction is the whole article.
τ-Banking turns customer support into a state-changing reasoning test
The τ-Banking domain is built around realistic fintech support workflows: opening and closing accounts, handling card replacement, applying rewards, disputing transactions, redeeming referral promotions, and recommending products under constraints.
The paper reports the following basic scale:
| Component | τ-Banking scale | Why it matters |
|---|---|---|
| Knowledge documents | 698 | Enough for realistic search noise, but small enough for controlled evaluation |
| Total tokens | 194,562 | Large enough to challenge context and retrieval strategies |
| Knowledge categories | 21 | Product and policy diversity creates cross-document dependencies |
| Discoverable tools | 51 | Agents must learn capabilities from documentation |
| Tasks | 97 | Enough to compare models and retrieval configurations |
| Average required documents per task | 18.6 | Tasks require multi-document reasoning, not one-shot lookup |
| Average required tool calls per task | 9.52 | Correct execution is multi-step |
| Maximum required tool calls | 33 | Some tasks require long-horizon orchestration |
The benchmark construction is also worth noting. The authors do not simply ask an LLM to improvise a pile of fake banking articles and hope the result is coherent. They begin with a structured database of product and policy variables, convert that structure into unstructured documents, then create tasks and expected final database states around the underlying variables. Human reviewers audit the tasks, gold documents, and valid solution trajectories.
This structured-to-unstructured pipeline is not just a dataset-generation trick. It gives the benchmark a useful property: the unstructured documents resemble enterprise documentation, but the authors can still verify whether a task is well-posed. In business terms, τ-Banking approximates the annoying middle ground between a clean database and a real company wiki, where the information exists but refuses to behave like a spreadsheet.
The mechanism of failure has four moving parts
A simple benchmark summary would say: “Models scored poorly.” True, but not very useful.
The more useful question is: where does the failure enter?
In τ-Knowledge, an agent can fail through at least four mechanisms.
First, it can fail to retrieve the right documents. This is the familiar RAG problem: the correct policy or product page never appears in context.
Second, it can retrieve the documents but misread their implications. This is the knowledge-use problem. The relevant text is present, but the agent does not combine rules correctly.
Third, it can discover or call the wrong tools. In τ-Banking, tools are part of the operational policy layer. Knowing the answer is not enough; the agent must execute the right state-changing operation.
Fourth, it can execute correct-looking actions in the wrong order. This is the silent killer in business workflows. A tool call may be valid in isolation but invalid after a different action changes the state.
These mechanisms interact. A bad search can lead to wrong reasoning. Wrong reasoning can lead to the wrong tool. The wrong tool can change the database, after which the next step becomes impossible. The model then produces an apologetic explanation, which is the corporate version of smoke coming out of the machine.
The paper’s best contribution is that it makes these failures observable. The final database state provides a hard evaluation target. The agent either completed the operational task or it did not.
The headline scores are low, but the gold-document result is the real warning
The main results are easy to state and harder to digest.
| Configuration | Best reported result | Interpretation |
|---|---|---|
| Best non-gold setup | GPT-5.2 high reasoning + terminal search: 25.52% pass^1 | Search plus reasoning still solves only about one quarter of tasks |
| Best pass^4 non-gold setup | GPT-5.2 high reasoning + Qwen3 embedding: 13.40% pass^4 | Repeated-trial reliability drops sharply |
| Best gold-document setup | Claude-4.5-Opus high reasoning: 39.69% pass^1 | Removing retrieval bottlenecks does not solve the benchmark |
| Best gold-document pass^4 | Claude-4.5-Opus high reasoning: 26.80% pass^4 | Even with required documents supplied, reliability remains limited |
The gold-document condition deserves more attention than the leaderboard. In this configuration, the agent receives the task-critical documents directly in context. This does not remove every difficulty, but it does remove a major retrieval bottleneck.
If the industry’s favorite explanation were sufficient — “the model just needs better retrieval” — gold documents should produce a much larger rescue. Instead, the strongest model remains below 40% pass^1.
That result shifts the diagnosis. Retrieval matters, but the deeper weakness is operational reasoning over policy, tools, and state.
The benchmark also uses pass^k, where success requires completion across independent trials. This is important because production systems do not only need occasional success. A support agent that completes a banking workflow correctly one out of four times is not “promising.” It is a compliance incident with a friendly tone.
Search method matters, but not in the way RAG dashboards usually imply
The paper compares dense retrieval, sparse BM25 retrieval, terminal-style search, and gold-document access.
Terminal-based search gives strong models a useful advantage. The terminal condition exposes the knowledge base as files and lets the agent use shell utilities such as grep, cat, and find. Averaged across models, terminal use outperforms dense and sparse retrieval with statistical significance.
But this benefit is uneven. Recent high-reasoning models benefit most. GPT-5.2 without reasoning effort and older GPT models do not gain the same advantage. This suggests that freeform search is not a magic interface. It helps only when the model can plan a search strategy, inspect intermediate results, and revise the search path intelligently.
The cost side is less glamorous. Dense retrieval averages around 9.9–10.1 searches per task. BM25 averages 11.4. Terminal use involves more exploration: the paper reports 28.8 shell calls per task on average across shell commands. Median turn time rises by 6.6 seconds relative to dense retrieval configurations.
So terminal access can improve performance, but it can also create a more expensive, slower, and noisier support interaction. For an internal research agent, that may be acceptable. For a customer-facing banking agent, “please enjoy this extended archaeological dig through our documentation” is not a product feature.
This is one of the paper’s most business-relevant points: retrieval quality and solution efficiency are separate metrics. An agent can preserve some success rate by searching harder, but customers experience that as latency, backtracking, and conversational sludge.
The appendix tests robustness, not a second thesis
Several additional experiments are easy to misread. They are not separate grand claims about the future of agent architecture. Their purpose is narrower: to check whether the main results are artifacts of retrieval configuration or benchmark design.
| Test or analysis | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Gold-document condition | Separate retrieval failure from knowledge-use failure | Agents still fail when the key documents are supplied | Retrieval is unimportant |
| Document recall | Measure whether gold documents appeared in context | Search success depends on both retriever and agent query behavior | Document presence guarantees correct action |
| No-knowledge setup | Confirm tasks require the knowledge base | Average pass^1 drops to 2% without KB access | All failures are caused by missing documents |
| Long-context full-KB setup | Test whether irrelevant documents create realistic noise | Full context peaks at only 12% pass^1 for tested models | Long-context models are useless generally |
| Reranker ablation | Check whether pointwise reranking improves retrieval setup | No significant pass^1 improvement in reported settings | Reranking can never help |
| Grep-tool ablation | Check whether simple keyword search improves retrieval setup | No significant pass^1 improvement for most retrievers | Keyword search is irrelevant in all enterprise systems |
| Retrieved-document count | Tune top-k retrieval against performance and latency | k=20 does not significantly beat k=10 in reported settings | More context is always harmful |
The cleanest practical lesson is not “use terminal search” or “do not use rerankers.” The lesson is more awkward: once an agent is expected to reason, search, and execute, the performance bottleneck moves around. Sometimes the retriever is the bottleneck. Sometimes the model’s search behavior is the bottleneck. Sometimes the document is already present and the agent still cannot convert it into the right action.
This is exactly why a simple RAG evaluation dashboard can be misleading. A retrieval metric may improve while task success barely moves.
The failures look painfully familiar to anyone who has managed operations
The qualitative analysis is where τ-Knowledge becomes more than a benchmark table.
The paper identifies four recurring error modes across failed trajectories. These errors are not exotic. They are the same mistakes a weak junior employee might make after skimming a policy portal.
| Failure mode | What happens | Business translation |
|---|---|---|
| Product interdependency mistakes | The agent notices one incentive but misses a better product-policy combination | The system optimizes a surface feature instead of the customer’s actual objective |
| Wrong subtask ordering | The agent executes actions in user-stated order even when policies require a different sequence | Valid actions become invalid because the workflow was not topologically planned |
| Overtrusting user assertions | The agent accepts user claims that should be checked against system state | The agent treats conversation as evidence when the database should be authority |
| Search inefficiency and assumptions | The agent commits to an underspecified interpretation or searches unfocusedly | Ambiguity becomes cost, latency, and wrong resolution |
The subtask-ordering example is especially useful. In one task, the user wants to dispute a transaction and request a credit limit increase. Bank policy rejects credit limit increases when disputes are pending. The correct path is not simply to satisfy both requests. The agent must infer that one operation should precede the other.
This is not a retrieval problem. The policy can be available. The agent still needs to build a dependency graph over actions.
That is a very different mental model from “answer the user’s question.” Enterprise agents need workflow reasoning. They must understand not only what a rule says, but when applying it changes the future action space.
The uncomfortable lesson for enterprise RAG: retrieved text is not operational control
Many businesses are still evaluating AI assistants as if the central question were: “Did the system retrieve a relevant chunk?”
τ-Knowledge suggests a better question: “Did the system produce the right final state under policy constraints?”
Those are not equivalent.
A retrieved paragraph can support a correct explanation while the agent still fails to update the database. A tool call can be technically valid while violating a hidden ordering constraint. A user’s statement can sound plausible while contradicting the system state. A product recommendation can satisfy one advertised benefit while failing the true optimization target.
For business deployment, the evaluation stack should therefore separate at least six layers:
| Evaluation layer | Example metric | Why it matters |
|---|---|---|
| Retrieval coverage | Did gold or relevant documents appear in context? | Diagnoses search failure |
| Policy interpretation | Did the agent apply the rule correctly? | Diagnoses reasoning failure |
| Tool discovery | Did the agent identify the needed operation? | Diagnoses capability-access failure |
| Action sequencing | Were state-changing steps ordered correctly? | Diagnoses workflow-planning failure |
| Final-state correctness | Did the database match the target state? | Measures operational success |
| Efficiency | Time, turns, tokens, tool calls | Measures deployment cost and user experience |
Most RAG pilots overmeasure the first layer and undermeasure the rest. This is understandable. Retrieval metrics are easier to compute. They also make vendors look better at demos, which is one of their natural ecological functions.
But in customer service, finance, insurance, HR, procurement, compliance, and internal operations, the value is not “the model cited the right policy.” The value is that the correct operation happened, with minimal risk and reasonable latency.
What Cognaptus would infer for business adoption
The paper directly shows that frontier models perform poorly on a simulated banking benchmark requiring unstructured knowledge retrieval, tool discovery, policy reasoning, and state-changing execution. It also shows that gold-document access improves performance but does not solve the task, and that search interface choices affect both success and latency.
From that, Cognaptus would infer several practical deployment principles.
First, do not treat RAG accuracy as the final acceptance test. RAG accuracy is a diagnostic, not the business outcome. The production acceptance test should include final-state verification, policy compliance, tool-call correctness, and repeated-run reliability.
Second, workflow maps are not optional. If an agent can perform actions that alter state, the system needs explicit modeling of dependencies: which action must happen first, which action blocks another, which user claim must be verified, and which tool output is authoritative.
Third, tool discovery should be governed rather than improvised. τ-Banking makes discoverable tools a benchmark feature. In production, that suggests a design question: should an agent freely discover tools from documentation, or should documentation be compiled into a controlled capability registry with versioning, permissions, and tests? The latter sounds less magical. It also sounds less likely to accidentally waive a fee because the user sounded persuasive.
Fourth, efficiency belongs in the business case. The paper shows that agents can compensate for weaker retrieval with more search and longer interactions. This matters because token cost is not the only cost. Customer time, support escalation, audit review, and operational delay all belong in the ROI model.
Fifth, the right benchmark should resemble the workflow being automated. A legal intake agent, insurance claims assistant, accounting support bot, or HR policy agent needs a task suite with realistic documents, tool calls, state transitions, and edge cases. Generic reasoning scores are useful background information. They are not a deployment certificate.
Where the result applies, and where it should not be overread
τ-Banking is simulated. The policies, products, users, and tools are synthetic or benchmark-constructed rather than copied from a real bank. That limits direct translation into production failure rates. A company should not read 25.52% pass^1 and conclude that its own agent will succeed exactly one quarter of the time.
The domain is also fintech customer support. The specific difficulty profile may differ in software support, healthcare administration, logistics, or internal IT. Some workflows have cleaner tool APIs. Some have better-maintained documentation. Some are worse. Many companies know which category they belong to and are quietly hoping no one asks.
The user simulator is LLM-based, although the authors audit sampled traces and report a low rate of task-critical user errors. This supports the validity of the benchmark but does not eliminate all simulator-related concerns.
Finally, the tested models and retrieval methods represent a point in time. Future models may improve. Better agent scaffolding, memory management, constrained planners, policy compilers, and human-in-the-loop controls may raise performance substantially.
None of these boundaries weaken the central lesson. They only keep it properly sized. τ-Knowledge is not a prophecy about every enterprise agent. It is a controlled demonstration that operational knowledge use is harder than document retrieval, and current systems still stumble when the manual becomes part of the workflow.
The real benchmark is whether the agent can work
The most useful sentence to take from τ-Knowledge is not “frontier models failed.” That is too easy, and slightly boring.
The better takeaway is this: enterprise agents fail through mechanisms that ordinary RAG evaluations do not see.
They fail when product rules interact. They fail when action order matters. They fail when user statements require verification. They fail when the correct tool is hidden in documentation. They fail when more context creates more noise. They fail when search becomes a substitute for understanding.
That is why τ-Knowledge is valuable. It shifts the evaluation target from “Can the model retrieve information?” to “Can the model use operational knowledge to change the world correctly?”
For businesses, this is the difference between a chatbot and an agent. A chatbot can explain a policy. An agent can execute it. The second task is where the money is, and conveniently, where the failures are harder to hide.
The manual was never the whole solution. It was only the beginning of the test.
Cognaptus: Automate the Present, Incubate the Future.
-
Quan Shi, Alexandra Zytek, Pedram Razavi, Karthik Narasimhan, and Victor Barres, “τ-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge,” arXiv:2603.04370, 2026. ↩︎