Agents with Interest: How Fintech Taught RAG to Read the Fine Print

Ask a product manager in a financial technology company a simple question — “How does this feature behave under that framework?” — and the answer may live in five places, three teams, two stale wikis, and one acronym that means different things depending on who had coffee with whom.

This is the everyday enemy of enterprise AI. Not lack of models. Not lack of dashboards. Not even lack of documents. The problem is that internal knowledge rarely behaves like a neat public benchmark. It is fragmented, duplicated, partially obsolete, acronym-heavy, and governed by access rules that make the usual “just send it to a cloud assistant” suggestion both naïve and professionally adventurous.

That is the setting for Retrieval Augmented Generation (RAG) for Fintech: Agentic Design and Evaluation, a Mastercard-linked study proposing an on-premise, agentic RAG architecture for fintech knowledge bases.¹ The paper does not claim to reinvent retrieval. Its useful contribution is more specific: it shows how a standard RAG pipeline starts to fail when enterprise documents are semantically sparse and organisationally messy, then tests whether specialised agents can repair some of those failure modes.

The answer is yes, but not in the way the “agents fix everything” crowd would enjoy printing on a conference tote bag. The gains are measurable, modest, query-dependent, and expensive in latency. Which makes the paper more useful, not less.

The failure starts before the model answers

A standard RAG system usually follows a simple route: reformulate the user query, retrieve similar chunks from a vector store, summarise the retrieved context, and return an answer. In clean settings, this is reasonable. In fintech, it is often a polite way to retrieve the wrong document with confidence.

The paper’s example is mundane enough to be convincing: internal Mastercard knowledge sources include note-taking apps, product platforms, architecture decks, and compliance PDFs. These documents may use acronyms inconsistently. A term such as “CMA” may refer to different internal concepts depending on context. A feature may be documented in one product wiki, referenced in another roadmap, and constrained by a third compliance note. The correct answer is not missing. It is scattered.

That distinction matters. Many enterprise AI projects treat retrieval failure as an indexing problem: chunk better, embed better, search harder. Sometimes that works. But in this paper, the central difficulty is not just finding a matching string. It is recovering intent when the query, document, and organisational context all under-specify the meaning.

Ordinary RAG fails here for four linked reasons:

Failure mode	What happens in a normal RAG pipeline	Why fintech makes it worse
Acronym ambiguity	The retriever matches shorthand without resolving meaning	Acronyms are dense, local, and often reused across teams
Fragmented evidence	One retrieval pass misses supporting context across documents	Product, compliance, and engineering knowledge sit in separate artefacts
Query underspecification	Broad questions retrieve broad but weakly relevant chunks	Users assume shared internal context that the system does not have
No second look	Retrieved chunks are summarised without deeper validation	Near-duplicate or adjacent pages can outrank the truly useful source

This is the reason a mechanism-first reading is better than a normal paper summary. The point is not “agentic RAG beats RAG.” The point is that each agent is a proposed repair for a specific retrieval pathology. The business question is whether those repairs are worth the overhead.

The agentic system turns retrieval from a line into a loop

The proposed system, A-RAG, adds an orchestrated set of specialised agents around the retrieval process. The baseline system, B-RAG, performs query reformulation, single-pass retrieval, and summarisation. A-RAG keeps those basic ingredients but adds more active control: intent classification, acronym resolution, sub-query generation, parallel retrieval, cross-encoder re-ranking, answer synthesis, and a QA agent that scores whether the answer is good enough to stop.

The mechanism is easy to understand if we avoid pretending “agent” is a magical noun. In this paper, an agent is mostly a specialised module with a task boundary. The orchestrator decides which modules to use and when to loop.

A user asks, for example, how CVaR is calculated in an IRRBB framework. The system first reformulates the query and expands known acronyms. It retrieves candidate chunks. It synthesises an answer. Then the QA agent scores the answer. If confidence is too low, the system generates narrower sub-queries such as “CVaR formula” or “IRRBB risk quantification,” retrieves again, re-ranks again, and attempts a better synthesis.

That loop is the important design move. Enterprise retrieval becomes less like asking a librarian for one book and more like asking an analyst to keep pulling threads until the answer is sufficiently supported. The elegance, such as it is, lies in giving the system permission to be dissatisfied with its first retrieval result. A startling innovation, apparently: asking again.

The strongest component is not the flashiest one

The paper’s discussion is careful about which pieces appear to matter most. The authors report that qualitative analysis found sub-query generation to be the most effective agentic component. That makes sense. If the underlying problem is fragmented evidence, decomposition is the natural repair. A vague or overloaded user question can be split into targeted searches, each more likely to retrieve a relevant slice of the internal corpus.

Acronym resolution is more mixed. The system uses local glossary logic and inline definition expansion to reduce ambiguity. In principle, this should help fintech retrieval substantially. In practice, the paper finds that acronym resolution can be error-prone when the acronym is undefined or insufficiently grounded in the retrieved document. In those cases, the agent may surface overly generic sources.

Cross-encoder re-ranking has a different role. It is not mainly about finding more documents; it is about ordering retrieved candidates by semantic fit. That helps when the first retrieval stage produces several near-matches. But it also contributes to latency. Re-ranking is the system’s editorial desk: useful, slower, and occasionally too pleased with itself.

The QA agent provides the stopping rule. Without it, agentic retrieval risks becoming a very expensive way to wander around a vector database. With it, the system can escalate only when the initial answer looks weak. The paper does not prove that this escalation policy is optimal, but it shows why adaptive depth is attractive in enterprise settings. Some questions need a quick lookup. Others need a search party.

The main evidence shows a real gain with a real cost

The main evaluation uses 85 validated question–answer–reference triples derived from an enterprise fintech knowledge base. The corpus itself is substantial: over 30,000 text chunks from 1,624 unique documents, with chunks mostly in the 50–120 word range. Both systems use the same LLM backend, Llama-3.1-8B-Instruct served through vLLM, with all-MiniLM-L6-v2 embeddings stored in ChromaDB.

The key results are clear:

Metric	Baseline RAG	Agentic RAG	Interpretation
Strict retrieval accuracy, Hit@5	54.12%	62.35%	A-RAG more often retrieves the exact ground-truth source in the top five
Adjusted retrieval accuracy	58.82%	69.41%	Manual review credits semantically valid alternate sources
Mean semantic score, 1–10	6.35	7.04	A-RAG answers are judged more semantically aligned with ground truth
Average latency per query	0.79s	5.02s	A-RAG is much slower because it performs more work

The strict retrieval result is the main evidence. A-RAG improves Hit@5 by 8.23 percentage points over B-RAG. That is not a revolution, but it is meaningful in a domain where retrieved context determines whether the final answer is grounded or merely fluent.

The adjusted retrieval result is more interesting but less formal. Manual inspection found that A-RAG retrieved valid answers from alternate, semantically equivalent sources in six additional cases, compared with three for B-RAG. Including those cases raises A-RAG to 69.41% and B-RAG to 58.82%. This supports the paper’s central claim that fragmented enterprise corpora can punish exact-link evaluation. The right answer may be present in a related document, not the labelled one.

But that adjustment also has a boundary: it is manually interpreted and not backed by a formal semantic-equivalence retrieval metric. It is useful evidence, not a new law of retrieval physics.

The semantic-answer evaluation adds another layer. An LLM judge scores answers from 1 to 10, where 9–10 means an exact match or perfect paraphrase, 6–8 means correct but missing minor detail, 3–5 means incomplete or an honest refusal, and 1–2 means incorrect or hallucinated. A-RAG scores 7.04 on average; B-RAG scores 6.35. The paper also reports that A-RAG reduces low-quality answers below score 5 from 18% to 8%, increases excellent answers at score 9 or above from 12% to 22%, and is preferred over B-RAG in 64% of cases.

This is the part to read carefully. A-RAG is not suddenly producing perfect expert answers. It is shifting the distribution upward. Fewer bad answers, more strong answers, and some ties. In enterprise AI, distribution shifts are often more important than headline wins, because operational reliability depends on reducing failure frequency, not winning a demo.

Then comes the bill: latency rises from 0.79 seconds to 5.02 seconds per query. That is more than a sixfold increase. For casual internal chat, this may feel clunky. For compliance-sensitive research, procedural lookup, or product alignment work, five seconds may be cheap. The business decision is not “faster or better.” It is whether the query deserves deeper retrieval.

The human-curated benchmark is a stress test, not a victory lap

The paper also includes a human-curated benchmark. This is not the main evaluation set; it is better read as a robustness and diagnostic test. It contains 17 questions with 33 distinct ground-truth source links and is designed around a “one correct, many plausible” retrieval scenario. The categories are definitional, procedural, and acronym-based queries.

The results are more nuanced:

Category	B-RAG coverage	A-RAG coverage	B-RAG semantic accuracy	A-RAG semantic accuracy
Overall	66.67%	69.70%	7.88	8.06
Definitional	73.68%	68.42%	7.78	7.89
Procedural	57.14%	100.0%	7.75	8.25
Acronym	57.14%	42.85%	8.25	8.25

This table is where the lazy interpretation dies quietly.

A-RAG performs especially well on procedural questions: coverage rises from 57.14% to 100.0%, and semantic accuracy improves from 7.75 to 8.25. That is exactly where sub-query generation should help. Procedures often involve steps, dependencies, and references across pages. A single-pass retriever may grab one fragment; an iterative system can gather the sequence.

Definitional queries do not show the same pattern. A-RAG’s coverage is slightly lower than B-RAG’s, though semantic accuracy is marginally higher. This suggests that for straightforward definitions, the extra machinery may not retrieve more sources and may not be necessary. Sometimes a dictionary does not need a committee.

The acronym result is the sharpest warning. A-RAG’s acronym coverage falls from 57.14% to 42.85%, while semantic accuracy remains equal at 8.25. One plausible interpretation is that the system can still synthesise adequate answers from partial context, but its retrieval coverage for acronym-specific sources is weaker. The paper suggests that acronym resolution and re-ranking may over-filter or misprioritise near-duplicate sources. That matters because acronym handling is one of the system’s advertised repairs.

So the human-curated benchmark does not say “agentic is better.” It says agentic orchestration is especially useful when the answer requires procedural assembly across dispersed evidence. It is less obviously useful for simple definitions, and its acronym module needs more work.

The evaluation method is part of the contribution

The paper’s third contribution is easy to overlook: it proposes an enterprise-feasible evaluation workflow. That matters because regulated organisations cannot always use public datasets, crowd workers, or external evaluation platforms. Confidentiality and data residency are not footnotes; they are deployment constraints.

The authors generate evaluation pairs from internal chunks using model-assisted prompts, then filter them for specificity, faithfulness, and completeness. Manual review removes ambiguous cases. The result is an 85-item validated evaluation set. The human-curated benchmark adds a smaller but more realistic stress test with multiple plausible answer sources.

This is not a perfect evaluation design. It is small. It is tied to one proprietary knowledge base. It uses an LLM judge, which introduces its own measurement risks. But the design is operationally relevant because many firms need a way to test internal AI systems without exporting sensitive data or waiting months for subject-matter experts to annotate everything by hand.

For business leaders, that may be the most transferable lesson. The architecture may change. The models certainly will. But a secure, repeatable internal evaluation loop is a durable capability.

What this means for banks, insurers, and payment firms

The direct result of the paper is narrow: on one internal fintech knowledge base, using one LLM and embedding setup, a modular agentic RAG pipeline outperforms a simpler baseline on retrieval and semantic-answer measures, with much higher latency.

The business inference is broader but must stay disciplined. Agentic RAG is best viewed as a routing strategy for organisational knowledge problems. It is not a universal replacement for baseline RAG. It should be applied where the query has enough ambiguity, procedural complexity, or cross-document dependency to justify slower inference.

A practical deployment policy might look like this:

Query type	Preferred approach	Rationale
Simple lookup	Baseline RAG	Low latency matters; extra agents add little
Procedural question	Agentic RAG	Sub-query generation can recover steps across documents
Cross-product comparison	Agentic RAG	Evidence likely sits in multiple pages and taxonomies
Acronym-heavy query	Conditional agentic RAG	Useful only if glossary quality and context grounding are strong
Compliance-sensitive answer	Agentic RAG with audit trail	Slower response may be justified by better evidence aggregation
High-volume customer chat	Baseline or hybrid routing	Five-second latency may not be acceptable at scale

This is where many enterprise AI strategies go wrong. They choose one architecture and force every use case through it. The paper points to a better design principle: match retrieval depth to query risk.

For a bank, the value is not that agents sound sophisticated. The value is fewer wrong internal answers about products, procedures, integrations, and controls. For an insurer, it may mean better navigation across policy wording, claims processes, and compliance memos. For a payment firm, it may mean faster reconciliation of feature behaviour across API documentation, product wikis, and risk frameworks.

The ROI case is therefore not “agentic RAG increases productivity.” That sentence should be taxed. The stronger case is that agentic retrieval reduces the human cost of assembling dispersed evidence in high-friction knowledge environments. It replaces some manual cross-checking, not all expert judgement.

The boundaries are narrow, and that is fine

The paper’s limitations matter because they define where the result can be trusted.

First, the main evaluation set has 85 questions. That is enough to show a signal, not enough to settle architecture choice for every fintech environment. The human-curated benchmark is smaller still, with 17 questions in the reported results. Its purpose is diagnostic, not definitive.

Second, the study uses one LLM backend and one embedding model. Different models, chunking strategies, re-rankers, glossaries, and vector databases could change the result. The paper tests an architecture under one implementation, not an invariant law of enterprise retrieval.

Third, the acronym resolver relies on heuristic and regex-based expansion. That is a fragile foundation for a domain where acronyms are precisely the problem. A stronger ontology, better context-aware disambiguation, or human-maintained glossary governance may be necessary before acronym-heavy use cases become reliable.

Fourth, adjusted retrieval accuracy is informative but informal. The idea is correct: exact-link matching undercounts valid answers in fragmented corpora. But the adjustment needs a formal metric if it is to support production evaluation.

Finally, the latency penalty is not a side issue. Going from 0.79 to 5.02 seconds changes where the system belongs. It may be perfectly acceptable for internal analyst tools and unacceptable for real-time customer-facing experiences. Architecture is not just accuracy; it is user tolerance, compute cost, and operational routing.

The real lesson: orchestration is a trade-off, not a trophy

This paper is valuable because it resists the clean fantasy of enterprise AI. Internal knowledge is not a library. It is a sedimentary formation of projects, teams, systems, and abbreviations. Standard RAG can retrieve from it, but retrieval alone does not understand why the right answer is split across adjacent artefacts.

Agentic RAG helps when it turns retrieval into a controlled investigation: clarify the query, resolve terms, search in pieces, re-rank evidence, score the answer, and loop only when necessary. That is the mechanism. The evidence suggests it improves retrieval and answer quality in a fintech corpus, especially for procedural questions. The trade-off is latency and added system complexity. The weak spot is acronym handling, which is amusing only if you are not the one deploying it.

The business lesson is not to buy more agents. It is to stop treating enterprise retrieval as a generic chatbot problem. Regulated firms need systems that know when a question is simple, when it is fragmented, and when the first answer is not good enough. That requires orchestration, evaluation, and domain governance.

In other words, fintech did not teach RAG to be brilliant. It taught RAG to be slightly less gullible. Given the state of enterprise documentation, that is already a respectable achievement.

Cognaptus: Automate the Present, Incubate the Future.

Thomas Cook, Richard Osuagwu, Liman Tsatiashvili, Vrynsia Vrynsia, Koustav Ghosal, Maraim Masoud, and Riccardo Mattivi, “Retrieval Augmented Generation (RAG) for Fintech: Agentic Design and Evaluation,” arXiv:2510.25518, 2025. https://arxiv.org/abs/2510.25518 ↩︎

The failure starts before the model answers#

The agentic system turns retrieval from a line into a loop#

The strongest component is not the flashiest one#

The main evidence shows a real gain with a real cost#

The human-curated benchmark is a stress test, not a victory lap#

The evaluation method is part of the contribution#

What this means for banks, insurers, and payment firms#

The boundaries are narrow, and that is fine#

The real lesson: orchestration is a trade-off, not a trophy#