Don't Trust. Verify: Fighting Financial Hallucinations with FRED

When ChatGPT makes up a statistic or misstates a date, it’s annoying. But when a financial assistant claims the wrong interest expense or misattributes a revenue source, it could move markets or mislead clients. This is the stark reality FRED confronts head-on.

FRED—short for Financial Retrieval-Enhanced Detection and Editing—is a framework fine-tuned to spot and fix factual errors in financial LLM outputs. Developed by researchers at Pegasi AI, it isn’t just another hallucination detection scheme. It’s an auditor with a domain-specific brain.

From Trustworthy Talk to Traceable Edits

The key innovation in FRED lies in its two-pronged ability:

Detect hallucinations—fine-grained factual errors across six types: Numerical, Temporal, Entity, Relation, Contradictory, and Unverifiable.
Edit them—replacing the falsehoods with grounded corrections using supporting documents.

This editing isn’t vague rewriting—it’s fully traceable:

The interest expense due <temporal><delete>September 2018</delete><mark>August 2008</mark></temporal>...

This structured format makes FRED’s output auditable, a crucial demand in finance.

Making Hallucinations on Purpose to Catch Them

FRED’s training data didn’t come from scraping real-world errors—it was synthesized with precision:

Starting from correct financial QA data (FinQA and TAT-QA).
Injecting controlled hallucinations via LLMs like GPT-4 and Gemma2.
Labeling hallucination types explicitly, then creating a three-part example:

Component	Example (Simplified)
Original	Interest expense from bonds due Sept 2018 is $19.5M
Erroneous	Interest expense from bonds due Aug 2008 is $19.5M
Corrected (Tagged)	…Aug 2008Sept 2018

This pipeline produced 36,000 training samples, making the model both domain-savvy and editing-aware.

Smaller Models. Sharper Judgment.

The surprise? A fine-tuned Phi-4 model (36K-trained) outperforms OpenAI’s state-of-the-art o3 model in financial hallucination detection:

Model	Overall F1 (FinQA+TAT-QA)	Binary Accuracy
OpenAI o3	71.9	90.3
Phi-4-36k	93.8	97.5

Even the lightweight Phi-4-mini (4B params) achieved 88.3 binary score, making FRED feasible for on-prem or edge deployments in regulated environments.

In editing performance (FactScoreLite), OpenAI still holds a slight edge in correcting errors, but Phi-4 narrows the gap dramatically.

Why This Matters for Real-World Finance

The financial industry isn’t just seeking accurate summaries—it needs auditable, edit-traceable LLMs that:

Flag unverified claims.
Show where edits were made.
Align responses with retrieved tables, statements, or filings.

FRED’s architecture supports this vision with interpretability baked in, and its use of small open models means it can be integrated into bespoke enterprise stacks—no API strings attached.

The Road Ahead: From Error Correction to Model Governance

FRED’s most compelling contribution isn’t just technical. It suggests a future where LLMs in finance are governed like junior analysts:

Their claims are checked.
Their edits are traceable.
Their hallucinations are not ignored, but diagnosed and repaired.

For developers building LLM-based research assistants, client chatbots, or audit tools, FRED offers a playbook—not just for reducing hallucinations, but for making models accountable.

Cognaptus: Automate the Present, Incubate the Future

From Trustworthy Talk to Traceable Edits#

Making Hallucinations on Purpose to Catch Them#

Smaller Models. Sharper Judgment.#

Why This Matters for Real-World Finance#

The Road Ahead: From Error Correction to Model Governance#

From Trustworthy Talk to Traceable Edits

Making Hallucinations on Purpose to Catch Them

Smaller Models. Sharper Judgment.

Why This Matters for Real-World Finance

The Road Ahead: From Error Correction to Model Governance