Don't Trust. Verify: Fighting Financial Hallucinations with FRED

TL;DR for operators

A finance chatbot can retrieve the right document and still give the wrong answer. That is the uncomfortable bit. Retrieval gives the model evidence; it does not force the model to use that evidence correctly. FRED, short for Financial Retrieval-Enhanced Detection and Editing of Hallucinations in Language Models, tackles the layer after retrieval: checking whether the generated answer actually matches the supplied context, then marking or correcting the factual errors.¹

The useful idea is not “use a better model”. That is the reflexive answer, and in finance it is also an expensive way to remain only half-safe. FRED builds a specialist verification workflow. It defines finance-relevant error types, inserts controlled errors into financial question-answering examples, filters bad synthetic labels, fine-tunes small language models, and evaluates them on detection and editing.

The headline result is strongest for detection. On the paper’s FinQA+TATQA detection test, Phi-4 fine-tuned on 36K examples reaches 93.8 overall F1 and 97.5 binary F1, compared with o3 at 71.9 and 90.3. That is not a rounding error. It suggests that a compact, supervised verifier can outperform a much more general frontier model when the task is narrow, labelled, and domain-shaped.

Editing is less tidy. On FAVA, o3 leads the editing results across all three evaluators. On FinQA+TATQA, Phi-4-36K slightly beats o3 when scored by gpt-4-turbo, but o3 leads under o3-mini and Llama-Scout scoring. So the business lesson is not “replace your best model with a small one”. It is: use specialist models where the work is repetitive, taxonomic, and auditable; keep stronger general models where correction quality still benefits from broader reasoning.

For financial AI products, FRED points toward a practical architecture: generation first, verification second, release third. The verifier should not merely say “this might be wrong”. It should identify whether the problem is numerical, temporal, entity-related, relational, contradictory, or unverifiable. That converts hallucination from an executive anxiety into an operations queue. Much less poetic, much more useful.

A correct source does not guarantee a correct answer

The standard defence against hallucination is retrieval. Put the annual report, earnings transcript, or loan schedule in the context window; ask the model to answer from that context; hope the answer behaves. This is better than letting the model improvise from memory. It is not the same as factual control.

Finance is especially unforgiving because many errors are small enough to look boring and large enough to matter. A model can retrieve the correct table and still copy the wrong year. It can calculate a percentage using the wrong denominator. It can turn “decreased” into “increased”, confuse one subsidiary with another, or add a confident sentence that is not supported anywhere in the source. None of these requires cinematic AI failure. No robot uprising. Just one wrong fiscal year in a board pack and suddenly everyone becomes a philosopher of verification.

FRED starts from that post-retrieval failure mode. The question is not whether the model has access to evidence. The question is whether the generated passage is consistent with the evidence it was given. That distinction matters because it changes the system design. Retrieval is an input-control mechanism. FRED is an output-control mechanism.

The paper therefore sits in the growing line of work on context-grounded hallucination detection and editing. Its distinctive move is to make the task finance-specific. Instead of treating hallucination as one vague category, it builds a taxonomy for the kinds of mistakes that financial QA systems actually make, then trains models to detect and edit those mistakes directly.

FRED’s useful trick is to make hallucination operational

The paper’s mechanism is straightforward, which is part of its appeal. It does not ask the reader to worship a new architecture diagram with arrows arranged like modern art. It builds a pipeline:

Define a finance-specific taxonomy of factual errors.
Take grounded responses from financial QA datasets.
Insert controlled hallucinations into those responses.
Filter and correct malformed synthetic examples.
Fine-tune small language models to mark and repair errors.
Compare detection and editing against larger baseline models.

That sequence matters. Many hallucination discussions fail because they jump directly from “models hallucinate” to “we need trust”. FRED inserts the missing middle layer: a repeatable production task.

The models are trained not merely to label an answer as wrong, but to produce structured edits. Editable error types such as numerical, temporal, entity, and relation errors are marked with deletion and replacement spans. Unverifiable statements are tagged as unsupported. The desired output is not a moral judgement on the answer; it is a corrected passage with machine-readable evidence of what changed.

For a finance product, this is the important design principle. The output of verification should be useful to another system: a human reviewer, a compliance workflow, a UI warning, a redline interface, or an automated retry loop. “Low confidence” is not enough. Low confidence is a shrug wearing a metric.

The taxonomy turns finance errors into reviewable work units

FRED adapts a fine-grained hallucination taxonomy from prior work but modifies it for finance. The authors consolidate invented, subjective, and unverifiable categories into a single “unverifiable” class, then add two categories that matter heavily in financial analysis: temporal and numerical errors.

That gives six error types:

Error type	What it captures	Why finance teams should care
Numerical	Wrong quantities, percentages, ratios, totals, or calculations	The model can be directionally right and financially wrong
Temporal	Wrong dates, fiscal years, quarters, periods, or event ordering	Many finance questions are period-sensitive by design
Entity	Wrong company, product, geography, instrument, or named object	Entity swaps are easy to miss in dense corporate text
Relation	Wrong relationship, attribution, comparison, or causal direction	“Increased” versus “decreased” can reverse the business conclusion
Contradictory	Statement conflicts with the provided context or another response part	Useful for catching direct inconsistencies
Unverifiable	Statement cannot be grounded in the supplied context	The classic “sounds plausible, appears nowhere” problem

This taxonomy is not just academic labelling. It changes review economics. A financial analyst does not need every error treated the same way. A numerical error may need recalculation. A temporal error may need source-period inspection. An unverifiable claim may need removal unless an external source is introduced. A relation error may need a semantic check against the table or paragraph.

The paper’s business-relevant insight is that hallucination control should route work by error type. In production, the tag is not the end of the task. It is the beginning of the right next action.

Synthetic corruption is the factory, filtering is the quality gate

The data construction step is the core mechanism. The authors use FinQA and TAT-QA examples from RagBench. These are financial question-answering datasets involving textual and tabular evidence, exactly the sort of material where a model can misread a number, a period, or a relation while still sounding perfectly fluent.

The paper keeps examples where the original response is grounded in the provided document. Then it creates corrupted versions by inserting tagged errors. This produces aligned pairs: an erroneous passage and a corrected target output. The model can therefore be trained as a specialist detector-editor.

The 36K finance dataset contains 11K examples from FinQA and 25K from TAT-QA. Its error distribution is not uniform: about 67.5% of examples are hallucinated and 32.5% are non-hallucinated. Among error types, temporal errors are the largest share at 30.8%, followed by numerical errors at 20.0%, contradictory statements at 18.6%, entity errors at 13.6%, unverifiable statements at 9.2%, and relation errors at 7.7%.

That distribution is useful for interpretation. The paper is not merely testing whether a model can detect any arbitrary hallucination. It is training and evaluating against a curated error ecology shaped around financial QA. That is also why the results should not be casually transferred to every enterprise domain. Legal contracts, clinical notes, insurance claims, and engineering logs have their own error ecologies. The method generalises more confidently than the exact trained model.

The less glamorous but important part is filtering. Synthetic data is not automatically high quality because a large model produced it. The authors inspect common generation problems: incorrect labels, identical delete-and-mark spans, invalid tag formatting, and inconsistent content that cannot be reconstructed cleanly. They treat incorrect type and identical text as fixable, while invalid format and inconsistent content are discarded as unfixable.

This is a quiet but serious point. Synthetic data pipelines need their own QA process. Otherwise the verifier learns the mess created by the generator, and everyone congratulates themselves on automation while manufacturing labelled confusion at scale. A fine tradition, but not a good one.

Detection is where the small specialist earns its keep

The detection results are the strongest part of the paper. The authors evaluate both general-domain FAVA and finance-specific FinQA+TATQA.

On FAVA, a general factuality dataset based on open-domain Wikipedia-style content, fine-tuned Phi-4 achieves the best overall detection performance: 79.8 overall F1 and 92.1 binary F1. o3 comes second with 69.8 overall F1 and 89.7 binary F1. This is useful comparison evidence, but not the main business reason to care. FAVA is not finance. It shows that the fine-tuned detector can work beyond the finance setup, but the real operational case is the finance benchmark.

On FinQA+TATQA, the gap becomes sharper:

Model / editor	Overall F1	Binary F1	Notes
GPT-4.1 mini	46.0	77.8	Weak across most fine-grained categories
o3	71.9	90.3	Strong baseline, especially temporal and unverifiable
Phi-4-mini-36K	72.0	88.3	Comparable overall to o3, weaker binary
Phi-4-8K	89.9	96.7	Large gain from fine-tuning even at smaller data scale
Phi-4-36K	93.8	97.5	Best overall and binary detection
Qwen3-4B-36K	72.4	89.7	Competitive with o3 overall, slightly lower binary
Qwen3-14B-36K	77.6	91.1	Beats o3 on binary, lower than Phi-4 variants overall

The obvious reading is that fine-tuning works. The more useful reading is narrower: fine-tuning works extremely well when the target task has a stable structure, a clear taxonomy, and a dataset that teaches the model how errors look in that domain.

Phi-4-36K is not “smarter than o3” in general. That would be the wrong lesson, and also the kind of sentence that should make procurement departments dangerous. The paper shows that a fine-tuned specialist can beat a larger generalist on a constrained verification task. This is exactly the kind of task where smaller models can have operational leverage: repetitive, bounded, structured, and measurable.

The category-level scores also matter. Phi-4-36K reaches 86.0 F1 on numerical errors, 93.3 on temporal errors, 92.1 on entity errors, 88.1 on relation errors, 94.3 on contradictory statements, and 94.5 on unverifiable statements. Those numbers suggest breadth across the taxonomy, not merely success on one easy class.

The appendix precision and recall tables add diagnostic colour. For FinQA+TATQA, Phi-4-36K has very high precision across categories, including 94.2 for numerical, 95.1 for temporal, and 95.0 overall. Its recall is also strong, though numerical recall is lower at 79.0. In plain English: when it flags numerical errors, it is usually right, but it can still miss some. That is a useful profile for a reviewer-assist system. It is less sufficient for a fully automated financial release gate where missing a numerical error is expensive.

Editing is useful, but the frontier model still bites back

Detection asks: where is the problem? Editing asks: can the model fix it?

The paper evaluates editing with FActScoreLite, using several evaluator backends: gpt-4-turbo, o3-mini, and Llama-Scout. The authors compare corrected outputs against the source content, with “No Edit” representing the erroneous passage before correction.

Here the results are more mixed.

On FAVA, o3 leads clearly: 92.6 under gpt-4-turbo, 95.3 under o3-mini, and 89.9 under Llama-Scout. Fine-tuned Phi-4 follows at 85.1, 81.1, and 82.4. Phi-4-mini and GPT-4.1 mini improve substantially over No Edit, but they do not match o3.

On FinQA+TATQA, Phi-4-36K is more competitive. It scores 91.4 under gpt-4-turbo, slightly ahead of o3 at 91.0. But o3 leads under o3-mini, 94.5 versus 84.6, and under Llama-Scout, 90.4 versus 86.0. So the editing story is not a clean victory for the small specialist.

That split is important. Detection and editing are related, not identical. A model can learn to locate and classify errors with high precision, while still needing broader reasoning ability to produce the best correction. This is especially true when correction requires not only replacing a span, but choosing the correct value from a table, preserving the sentence, and maintaining output format.

For operators, this implies a layered design:

Task	Best-supported lesson from the paper	Practical design choice
Detect whether an answer contains grounded factual errors	Fine-tuned specialist models can outperform general frontier models on the paper’s finance detection task	Use a smaller verifier as a post-generation guardrail
Classify the type of error	The taxonomy provides reviewable categories	Route numerical, temporal, entity, relation, contradiction, and unverifiable cases differently
Produce corrected text	Results are useful but less decisive; o3 remains stronger under most editing evaluators	Use automated edits as suggestions, not silent replacements, especially in high-stakes outputs
Reduce review burden	Structured tags make review faster and more auditable	Show redlines and source context to analysts or compliance reviewers

This is where the paper is most commercially realistic, whether or not it says so directly. Detection is a natural automation target. Editing is a natural human-in-the-loop acceleration target. The former can block, route, or prioritise. The latter should probably propose.

The appendix is mostly diagnostics, not a hidden second thesis

The appendix material is worth reading because it clarifies what the experiments are doing.

The model-insertion investigation is an implementation detail and data-quality check. The authors test several models for synthetic error insertion and observe that GPT-3.5-turbo, GPT-4-turbo, and Gemma2-9B-IT produce the fewest unfixable errors among the sampled examples. This does not prove those models are generally best for synthetic data generation. The manual check is tiny: ten examples. Its purpose is more practical than definitive: choose acceptable generators for the pipeline.

The 8K versus 36K training comparison is closer to a scale sensitivity test. Phi-4 improves from 89.9 overall F1 at 8K to 93.8 at 36K on FinQA+TATQA detection. Phi-4-mini improves from 63.9 to 72.0. The result supports the unsurprising but useful idea that more domain-shaped synthetic data helps. It does not identify a saturation point. Nobody gets to say “36K is enough”. Nice try.

The precision and recall appendix tables are diagnostic evidence. They show where models over-detect, miss errors, or trade precision against recall. For production, these tables are more useful than the headline F1 score because business risk is asymmetric. In some workflows, false positives are tolerable because they merely annoy reviewers. In others, false negatives are dangerous because a bad answer reaches a client.

The training hyperparameters are implementation details: LoRA rank 16, two epochs, 8192-token context, 4-bit fine-tuning for supported models, and an A100 40GB setup through Google Colab. This helps readers understand feasibility. It does not by itself prove the setup is cheap at enterprise scale, because enterprise cost depends on data preparation, evaluation, governance, integration, monitoring, and the emotional tax of convincing legal that “synthetic hallucination insertion” is a real phrase.

What this means for a finance RAG stack

The business application is not to replace retrieval. It is to stop pretending retrieval is the entire safety story.

A finance RAG product using FRED-like ideas would look something like this:

User question
   ↓
Retrieve financial context
   ↓
Generate answer
   ↓
Run specialist verifier/editor
   ↓
Classify errors by type
   ↓
Route output:
   - clean answer
   - answer with warnings
   - redlined correction
   - human review
   - regenerate with stricter prompt

The verifier becomes an internal control layer. It can sit between the LLM and the user interface, between an analyst copilot and a draft report, or between an automated earnings-call QA system and a publishable answer.

The ROI is not just lower inference cost from using small models. That is the easy headline and often the least interesting one. The deeper value is operational:

Operational need	How a FRED-like layer helps
Analyst trust	Shows exactly which span is wrong and why
Compliance review	Produces structured evidence of checks performed
QA prioritisation	Routes high-risk numerical or temporal errors first
Product reliability	Blocks unsupported claims before they reach the user
Model monitoring	Tracks error types over time, not just generic failure rates
Continuous improvement	Builds a labelled error dataset from real review workflows

The last point is especially important. In a mature deployment, the synthetic dataset is only the beginning. Human-reviewed production errors should feed back into the verifier. Over time, the company builds a private taxonomy of its own failure modes: the metrics users ask about, the documents models misread, the table structures that cause problems, the recurring entity confusions, the prompts that invite unsupported commentary.

That private error dataset may become more valuable than another round of generic model switching. Swapping foundation models every quarter is not a strategy. It is software astrology with invoices.

What the paper directly shows, and what Cognaptus infers

The paper directly shows that fine-tuned small language models can achieve strong detection performance on synthetic, context-grounded hallucination tasks built from FAVA and finance QA datasets. It also shows that a finance-specific taxonomy improves the fit between the detection task and the errors that matter in financial QA. On the authors’ FinQA+TATQA detection setup, Phi-4-36K outperforms o3 by a wide margin on overall F1 and binary F1.

The paper also directly shows that editing is harder to dominate. Phi-4-36K is competitive on FinQA+TATQA and slightly leads o3 under one evaluator, but o3 leads under the other two evaluators and dominates FAVA editing.

Cognaptus infers that a FRED-like design is most immediately valuable as a verification and review layer for finance RAG systems, not as a fully autonomous correction engine. That inference is grounded in the detection strength, the structured tagging design, and the mixed editing results. It is not a claim that the paper has demonstrated production readiness.

The uncertain part is real-world performance. The paper’s own limitations state that evaluation relies on language-model-generated synthetic data and that the mechanisms behind the observed gains are not yet fully understood. That matters. Real enterprise errors are messier than synthetic perturbations. They include OCR artefacts, malformed tables, stale filings, contradictory retrieved documents, ambiguous accounting language, and users asking questions that should not be answered from the provided context at all.

Boundaries before procurement gets excited

There are four boundaries to keep in view.

First, the data is synthetic. Synthetic data is useful because it gives control over error types and target labels. It is also dangerous because models can become excellent at detecting the kinds of errors the pipeline knows how to invent. Real hallucinations may not follow the same distribution.

Second, the task is benchmark-style financial QA. FinQA and TAT-QA are relevant, but they are not the same as live equity research, credit memo drafting, investment banking models, private company due diligence, or regulatory reporting. The method is portable. The measured performance is not automatically portable.

Third, editing is evaluated through model-based scoring. FActScoreLite with different evaluator backends gives useful comparative evidence, but model-evaluated correction quality is not a substitute for expert review in high-stakes finance. The fact that scores shift depending on the evaluator is itself a reminder that “corrected” is not always a single cleanly observed outcome.

Fourth, the mechanism is not fully explained. The authors explicitly note that future work should investigate why the framework outperforms baselines. This is not fatal for deployment, but it affects governance. In regulated domains, a strong metric is helpful; a strong metric plus a clear failure analysis is better.

None of these boundaries makes the paper weak. They make it usable. The point is not to pretend FRED solves financial hallucination. The point is to identify the exact layer where it can reduce risk today: structured, domain-specific post-generation verification.

Verification is a workflow, not a vibe

FRED’s most useful contribution is conceptual discipline. It takes “financial hallucination”, a phrase that is too broad to operate, and breaks it into a workflow: define error types, generate controlled failures, filter bad labels, fine-tune specialist detectors, evaluate detection and editing separately.

That distinction between detection and editing is the article’s centre of gravity. Detection looks ready for serious product experimentation. Editing looks promising but still needs careful supervision. A small model can be a very good financial proofreader without becoming the final author of truth.

For businesses building AI over financial documents, the practical lesson is blunt: do not trust the answer just because the source was retrieved. Verify the relationship between source and answer. Make the verification structured. Measure error categories. Keep humans where correction risk remains high. Automate the boring diagnostic layer before automating the judgement layer.

It is less glamorous than declaring the hallucination problem solved. Fortunately, glamour has never reconciled a cash-flow statement.

Cognaptus: Automate the Present, Incubate the Future.

Likun Tan, Kuan-Wei Huang, and Kevin Wu, “FRED: Financial Retrieval-Enhanced Detection and Editing of Hallucinations in Language Models,” arXiv:2507.20930, 2025. https://arxiv.org/pdf/2507.20930 ↩︎

TL;DR for operators#

A correct source does not guarantee a correct answer#

FRED’s useful trick is to make hallucination operational#

The taxonomy turns finance errors into reviewable work units#

Synthetic corruption is the factory, filtering is the quality gate#

Detection is where the small specialist earns its keep#

Editing is useful, but the frontier model still bites back#

The appendix is mostly diagnostics, not a hidden second thesis#

What this means for a finance RAG stack#

What the paper directly shows, and what Cognaptus infers#

Boundaries before procurement gets excited#

Verification is a workflow, not a vibe#