Financial AI promises speed and scale — but in finance, a single misplaced digit can be the difference between compliance and catastrophe. The FAITH (Framework for Assessing Intrinsic Tabular Hallucinations) benchmark tackles this risk head‑on, probing how well large language models can faithfully extract and compute numbers from the dense, interconnected tables in 10‑K filings.

From Idea to Dataset: Masking With a Purpose

FAITH reframes hallucination detection as a context‑aware masked span prediction task. It takes real S&P 500 annual reports, hides specific numeric spans, and asks the model to recover them — but only after ensuring three non‑negotiable conditions:

Criterion Why It Matters in Finance
Uniqueness Prevents multiple plausible answers, avoiding ambiguity in metrics.
Consistency Ensures the masked number matches other parts of the report, avoiding internal contradictions.
Answerability Guarantees the missing value can be derived from the given tables/text, ensuring fairness in evaluation.

By restricting to numbers with units and verifying answerability via unanimous agreement from top LLMs, FAITH produces a large, clean dataset — scaling from a pilot set (9 companies, human‑annotated) to a main set (453 companies, LLM‑annotated).

Four Flavors of Financial Reasoning

To see where models stumble, FAITH categorizes tasks into four mutually exclusive reasoning types:

  1. Direct Lookup – Find the exact number in a single cell.
  2. Comparative Calculation – Compare the same metric across periods (e.g., year‑over‑year change).
  3. Bivariate Calculation – Combine two metrics (e.g., gross margin).
  4. Multivariate Calculation – Multi‑step logic over three or more metrics.

This isn’t just academic bookkeeping — it creates a performance map that pinpoints when hallucinations creep in.

What the Numbers Say

Across dozens of LLMs, the leaderboard tells a clear story:

  • Top performers — Claude‑Sonnet‑4 (95.6% main split), Gemini‑2.5‑Pro (91.9%) — are strong but not flawless. Even 4–8% error rates are unacceptable for regulatory filings.
  • Mid‑tier — GPT‑4.1 (89.2%), GPT‑4.1‑mini (88.2%) — respectable but notably weaker.
  • Lower‑tier — many open‑source models dip below 50%, with some failing outright on complex scenarios.

The accuracy drop from Direct Lookup to Multivariate Calculation is stark. Many models collapse to ~0% in the hardest category, underscoring how reasoning complexity is the primary driver of hallucinations.

The Scale Problem — and Why It’s Fixable

One repeat offender: scale errors — correct numbers with the wrong magnitude ($150 vs $150M). In Llama‑3.3‑70B, fixing just this mistake would have raised value accuracy from 37.0% to 57.7%. This is low‑hanging fruit for model alignment and post‑processing.

When Tables and Text Must Talk

FAITH’s most telling case study involved a masked $20.2M equity investment figure. Only Gemini‑2.5‑Pro pieced it together, integrating:

  • Text describing a 90% purchase of Mohawk Commons for $62.1M.
  • Table data showing 18.1% ownership and $7.2M debt.

It reverse‑engineered total debt, computed equity, applied ownership share, and nailed the number. Other models skipped the debt step, inflating the figure — proof that latent variable inference remains a frontier problem.

Implications for Finance Teams

  • Model Selection Matters — Small gaps in benchmark scores translate into big differences in live deployments.
  • Test for Your Context — Even high‑accuracy models can fail catastrophically on your firm’s specific table formats.
  • Targeted Error Mitigation — Post‑hoc checks for scale errors and missing‑variable reasoning could slash hallucination risk.

FAITH isn’t just another benchmark — it’s a mirror held up to LLMs’ numerical integrity. In high‑stakes domains, it’s a reminder: Trust, but verify.


Cognaptus: Automate the Present, Incubate the Future