Mind the Earnings Gap: Why LLMs Still Flunk Financial Decision-Making

TL;DR for operators

A financial AI system does not fail only when it invents a company, misreads a filing, or forgets what EBITDA means. Those are the obvious failures. FinanceBench is more useful because it exposes the quieter failure mode: the model has access to the document, produces a coherent answer, and still gets the financial question wrong.¹

The operational lesson is not “never use LLMs in finance.” That would be dramatic, and finance already has enough theatre. The lesson is narrower and more expensive: financial copilots need retrieval governance, numerical verification, source-level evidence trails, refusal discipline, and workflow boundaries before they are allowed anywhere near analyst-facing decisions.

FinanceBench is a minimum bar. It asks clear questions about public filings and checks whether models can answer with evidence. If a model cannot reliably answer those questions, it should not be promoted into a decision-making agent, no matter how nicely it formats the memo.

The spreadsheet is not the hard part

Picture a familiar finance workflow. Someone asks for a quick answer before a committee meeting: Did this company’s operating margin improve? Was depreciation included correctly? Is the answer visible in the 10-K or are we inferring? A junior analyst opens the filing, finds the line item, checks the unit, performs the calculation, writes the answer, and cites the page.

This is not glamorous work. It is also exactly where many financial AI systems start. Before the model is asked to recommend a stock, rebalance a portfolio, or detect strategic risk, it must survive the dull machinery of evidence-backed financial question answering. The dull machinery is where money quietly escapes.

FinanceBench focuses on that machinery. It is not a trading contest, not a market sentiment toy, and not a test of whether a model can sound like a hedge fund intern after two espressos. It evaluates open-book financial QA over public company documents: 10-Ks, 10-Qs, 8-Ks, and earnings reports. The benchmark contains 10,231 question-answer-evidence triplets covering 40 US public companies and 361 filings from 2015 to 2023.¹

That design matters. The questions are not meant to be impossible riddles. They are deliberately close to the ordinary work of financial analysis: extract figures, compare periods, compute ratios, interpret a disclosed fact, and justify the result using evidence. In other words, FinanceBench is not asking the model to be Warren Buffett. It is asking the model to read the filing properly. A modest request, apparently still too emotionally demanding.

FinanceBench tests a minimum standard, not a grand theory of markets

The most common misconception is to treat FinanceBench as a verdict on whether LLMs can “do finance.” It is not. It tests a narrower capability: can a model answer clear financial questions when the answer exists in public filings and should be supported by evidence?

That narrowness is the point. Financial decision-making has layers. At the bottom is factual competence: find the right document, retrieve the right evidence, perform the right calculation, and state the answer with the right unit. Above that sits interpretation: understand whether the number matters. Above that sits decision-making: act under uncertainty, opportunity cost, risk appetite, and timing pressure.

FinanceBench lives mostly at the first layer, with some movement into numerical and logical reasoning. Its question set includes domain-relevant questions, novel analyst-style questions, and metrics-generated questions. Among the domain-relevant and metrics-generated questions that the authors taxonomised, 28% involved information extraction only, 66% involved numerical reasoning, and 6% involved logical reasoning.¹ That distribution is telling. The benchmark is not just “find a sentence.” It repeatedly asks the model to turn disclosed financial data into a usable answer.

This makes FinanceBench more commercially relevant than a leaderboard built from generic financial trivia. Enterprise users do not usually ask, “What is a bond?” They ask, “Using the 2022 annual report, what was Boeing’s cost of goods sold in USD millions?” The difference is not cosmetic. The first question tests vocabulary. The second tests document grounding, units, table parsing, and calculation discipline. Finance tends to prefer the second. It has invoices to pay.

The performance gap is a systems gap

The headline result is uncomfortable because it separates model intelligence from deployment design.

In the closed-book setting, GPT-4-Turbo answered only 9% of the 150 evaluation questions correctly. That poor result is not surprising; asking a model to recall filing-specific facts without the filing is less “AI strategy” than “expensive guessing.” The useful comparison comes from the open-book configurations.

A shared vector store across filings produced weak results. GPT-4-Turbo with a shared vector store answered 19% correctly, failed to answer 68%, and answered incorrectly 13%. With a single vector store per filing, GPT-4-Turbo rose to 50% correct. With the relevant filing placed into a long context window, it reached 79% correct. In an unrealistic oracle setup, where the model received the evidence pages needed for the answer, it reached 85% correct.¹

Those numbers should change how operators think about the problem. The gap between 19% and 79% is not simply a “model quality” gap. It is a retrieval and context-management gap. The model performs very differently depending on whether the evidence is retrieved, how much irrelevant material surrounds it, and where the question appears relative to the context.

Configuration pattern	What the paper directly shows	Cognaptus interpretation	Business boundary
Closed-book answering	Filing-specific QA collapses without evidence access	Do not use foundation models as memory engines for company facts	Some general company facts may be known, but audit-grade answers need sources
Shared vector store	Retrieval over many filings can fail badly	Enterprise RAG is not automatically safer because it has more documents	Index quality, chunking, metadata, and retrieval evaluation are core controls
Single-document retrieval	Narrower retrieval improves performance	Scope control is a practical reliability lever	It assumes the right document has already been selected
Long context	Full-filing access improves results	Long context can bypass some retrieval failures	It increases cost, latency, and attention-management problems
Oracle evidence	Even with evidence pages, errors remain	Retrieval is necessary but not sufficient; reasoning still fails	This is not a realistic operating configuration

The oracle result deserves special attention. If the model receives the relevant evidence pages and still misses 15% of answers, the failure is not only retrieval. Some errors occur after the right material is already available. That is the part vendors prefer to discuss later, preferably after procurement.

Retrieval helps only when it retrieves the right thing

FinanceBench is quietly brutal toward naive RAG. Retrieval-augmented generation is often sold as if attaching a vector database to a language model transforms it into an evidence-governed analyst. The paper says: charming idea, please test it.

The shared vector-store setup is closer to how enterprises are tempted to deploy: put the filings in an index, ask questions, let embeddings do their magic, add a citation-looking output, and hope the board never asks for the audit trail. FinanceBench shows why that hope is not an operating control. A model can fail because the retriever pulls the wrong filing section, because a relevant table is chunked badly, because units are split from values, or because the question requires multiple statements rather than one convenient paragraph.

The distinction between shared and single vector stores is especially important. A single-document index is much easier: the model searches inside one known filing. A shared index forces the system to retrieve across many documents. That is closer to real enterprise use, where analysts ask across issuers, periods, amended filings, and source types. The benchmark shows that expanding the document universe can make the model less useful, not more. More context is not more knowledge when the system cannot rank evidence.

This is where the business implication becomes concrete. The ROI of a financial copilot is not measured by whether it can answer a demo question. It is measured by how often it retrieves the right evidence under messy document conditions, how quickly reviewers can verify it, and how cheaply errors are caught before they enter reports, credit memos, valuation decks, or client communications.

Long context is a workaround, not a governance model

The long-context results are the most tempting. GPT-4-Turbo with the filing in context reached 79% correct, and Claude2 reached 76%. That is much better than the vector-store configurations. It suggests that, for some financial QA tasks, brute-force context loading can rescue the model from retrieval failure.

But long context is not a free lunch. It is lunch with a service charge, latency, and a suspiciously vague invoice.

FinanceBench notes practical constraints: public filings can run hundreds of pages, long-context models are slower and more expensive, and context windows still may not fit everything. Even when the full or truncated filing is included, the model must still attend to the right section, preserve the question, and avoid being distracted by nearby but irrelevant financial text.

The prompt-order result makes this painfully clear. In the long-context setting, placing the filing before the question produced much stronger performance than placing the question before the filing: GPT-4-Turbo dropped from 78% to 25%, and Claude2 from 76% to 37%, depending on whether the question came after or before the long evidence block.¹ This is not a minor prompt-engineering footnote. It means financial QA performance can swing violently because of context layout.

For operators, that turns prompt design into process design. The system must standardise context order, preserve the user’s question near the answer-generation step, and test degradation as documents lengthen. Otherwise, the AI assistant is less an analyst and more a filing shredder with a pleasant tone.

A wrong answer is worse than a refusal

FinanceBench separates incorrect answers from failures to answer. That distinction matters more in finance than in casual search.

A refusal can be annoying. It wastes time. It may force a human to open the document. But a plausible incorrect answer contaminates downstream judgement. It can enter a memo, feed a valuation model, justify a covenant interpretation, or produce a client-facing explanation that looks properly sourced until someone checks the line item.

The paper’s qualitative analysis found that models sometimes produced coherent, well-justified answers that were still wrong. Some errors were calculation mistakes. Others involved wrong units, wrong directionality, outdated assumptions, or invented legal information. The dangerous feature is not stupidity. It is fluent wrongness with paperwork attached.

This creates a practical design rule: in financial workflows, the model should be rewarded for calibrated abstention. A system that refuses when evidence is weak may feel less magical, but it is often safer than one that fills gaps with decorative confidence. Financial automation should optimise for verified correctness, not conversational completion.

That also changes evaluation. Accuracy alone is not enough. Operators should measure:

retrieval hit rate against gold evidence;
answer correctness;
unit and period correctness;
calculation reproducibility;
citation faithfulness;
refusal quality;
reviewer correction time;
error severity by downstream use case.

A wrong answer about a company’s reported inventory is not the same as a wrong answer about a debt covenant, revenue recognition, or liquidity position. FinanceBench gives the starting point. Enterprises still need severity-weighted evaluation.

The paper is not anti-LLM; it is anti-handwaving

A useful reading of FinanceBench should avoid the lazy conclusion that LLMs are useless in finance. Other research points in a more nuanced direction. Kim, Muhn, and Nikolaev find that GPT-4 can generate useful financial-statement analysis from standardised, anonymised statements and, in their setting, can outperform professional analysts in predicting the direction of future earnings changes.² That does not contradict FinanceBench. It clarifies the boundary.

In the financial-statement analysis study, the model receives a structured analytical task with standardised inputs. In FinanceBench, the model must locate evidence in public filings and answer grounded questions. One tests analytical pattern recognition over prepared financial statements. The other tests document-grounded QA under retrieval and evidence constraints.

Both are relevant. Together they suggest that LLMs may have genuine financial reasoning capacity, but that capacity is easily throttled by document plumbing, retrieval design, context order, and verification gaps. The model may be capable of useful inference once the data is cleanly presented. The enterprise system still has to get the data cleanly presented. Minor detail. Only the whole job.

This is also why newer financial-agent benchmarks such as InvestorBench push beyond static QA toward sequential financial decision-making across stocks, cryptocurrencies, and ETFs.³ That expansion is necessary, but it should not let teams skip the lower-level QA problem. A trading or portfolio agent that cannot reliably answer filing-grounded questions is not “agentic.” It is merely autonomous in the same way a shopping cart is autonomous when released on a slope.

What Cognaptus would infer for deployment

FinanceBench directly shows model weakness on filing-grounded financial QA. Cognaptus infers a broader deployment lesson: financial AI should be built as an auditable analytical system, not a chat window pointed at a document folder.

That means the system architecture should separate tasks that demos often blur together.

Layer	Required control	Why it matters
Document selection	Filing identity, period, amendment status, source provenance	A correct calculation from the wrong filing is still wrong
Retrieval	Evidence recall tests, metadata filters, table-aware chunking	Financial answers often depend on exact rows, units, and periods
Reasoning	Calculator/tool use, formula templates, intermediate checks	LLM arithmetic remains fragile when values span tables
Answer generation	Evidence-linked claims, unit normalisation, uncertainty labels	Polished prose should not hide weak grounding
Review	Human escalation, severity scoring, audit logs	Not all errors carry the same business risk
Monitoring	Benchmark regression tests across model and index updates	Model upgrades can silently change behaviour

The point is not to smother the model in process until it becomes a very expensive PDF search bar. The point is to place the model where it creates leverage: drafting evidence-backed explanations, summarising retrieved material, comparing disclosed metrics, and surfacing candidate answers for review.

The human analyst should not be replaced by a model that occasionally remembers how percentages work. The better target is analyst throughput: faster evidence location, fewer manual extraction steps, cleaner first drafts, and more systematic checks. That is still valuable. It is simply less theatrical than “AI replaces Wall Street,” which is unfortunate for conference panels but healthier for balance sheets.

Boundaries that matter before anyone overgeneralises

FinanceBench has limits, and they affect how the results should be used.

First, it is a single-turn QA benchmark. Real analysts ask follow-up questions, challenge assumptions, refine metrics, and compare companies. A system that performs well on FinanceBench may still fail in multi-turn analysis if it loses context or compounds earlier mistakes.

Second, it focuses on public filings from public companies. That is appropriate for reproducibility, but it does not cover private company diligence, management data rooms, broker research, channel checks, or messy internal spreadsheets. In private markets, the source material is often less standardised and less audited. The model does not become more reliable because the documents become worse. Funny how that works.

Third, FinanceBench is not a portfolio benchmark. It does not evaluate transaction costs, market impact, risk budgeting, timing, factor exposure, or behavioural feedback loops. Work such as DeepFund argues that live, leakage-resistant benchmarking is needed for investment systems because historical backtests can let models “time travel” through pretraining exposure.⁴ That is a different problem from filing QA, but the family resemblance is clear: financial AI needs evaluation environments that punish shortcuts.

Fourth, benchmark progress should not be mistaken for deployment readiness. FLaME, a 2025 financial language-model evaluation suite, reflects the field’s move toward broader, standardised, multi-task assessment across financial NLP tasks.⁵ That is welcome. But enterprise systems still need use-case-specific validation. A model that performs well on sentiment classification may still mishandle a debt maturity table. A model that summarises an earnings report may still calculate a margin incorrectly. Finance is rude like that.

The real benchmark is the workflow

FinanceBench is valuable because it turns a fashionable question into an operational one. The fashionable question is: Can LLMs do financial analysis? The operational question is: Can this system answer this financial question, from this filing, with this evidence, under this latency and review constraint, at an error rate we can tolerate?

That second question is less glamorous, which is usually a sign that it is closer to the truth.

The right takeaway is not pessimism. It is discipline. LLMs can already help with financial workflows when the task is scoped, the evidence is controlled, the calculations are checked, and the output is reviewed. They become dangerous when fluency is mistaken for verification, when retrieval is treated as solved, and when refusal is punished because someone wants the dashboard to look clever.

FinanceBench does not say the financial AI project is dead. It says the demo is not the control environment. In finance, that distinction is not academic. It is the difference between a useful analyst copilot and a confident intern with access to every filing and no instinct for when to stop talking.

Cognaptus: Automate the Present, Incubate the Future.

Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino Scherrer, and Bertie Vidgen, “FinanceBench: A New Benchmark for Financial Question Answering,” arXiv:2311.11944, 2023, https://arxiv.org/abs/2311.11944. ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Alex G. Kim, Maximilian Muhn, and Valeri V. Nikolaev, “Financial Statement Analysis with Large Language Models,” arXiv:2407.17866, revised draft, 2024, https://arxiv.org/abs/2407.17866. ↩︎
Haohang Li et al., “INVESTORBENCH: A Benchmark for Financial Decision-Making Tasks with LLM-based Agent,” arXiv:2412.18174, 2024, https://arxiv.org/abs/2412.18174. ↩︎
“Time Travel is Cheating: Going Live with DeepFund for Real-Time Fund Investment Benchmarking,” arXiv:2505.11065, 2025, https://arxiv.org/abs/2505.11065. ↩︎
Glenn Matlin, Mika Okamoto, Huzaifa Pardawala, Yang Yang, and Sudheer Chava, “Finance Language Model Evaluation (FLaME),” arXiv:2506.15846, 2025, https://arxiv.org/abs/2506.15846. ↩︎

TL;DR for operators#

The spreadsheet is not the hard part#

FinanceBench tests a minimum standard, not a grand theory of markets#

The performance gap is a systems gap#

Retrieval helps only when it retrieves the right thing#

Long context is a workaround, not a governance model#

A wrong answer is worse than a refusal#

The paper is not anti-LLM; it is anti-handwaving#

What Cognaptus would infer for deployment#

Boundaries that matter before anyone overgeneralises#

The real benchmark is the workflow#