Show Me the Money (Reasoning): Benchmarking Financial Intelligence in LLMs

Money has a useful habit: it exposes nonsense quickly.

In ordinary chatbot use, a slightly wrong answer may be annoying. In financial analysis, a slightly wrong number can change a valuation, distort a risk view, or make a portfolio note look more confident than it deserves. That is why financial AI is not just another “domain application” of large language models. It is a stress test for whether a model can combine facts, time, arithmetic, business context, and restraint without pretending that a polished paragraph is the same as a verified conclusion.

The paper behind today’s article introduces the AI Financial Intelligence Benchmark, or AFIB, and uses it to compare GPT, Gemini, Perplexity, Claude, and SuperInvesting on financial-analysis tasks centered on Indian equities.¹ The benchmark asks a practical question: which AI systems can behave like useful investment-research assistants, not just fluent explainers of finance vocabulary?

The answer is not “the model with web search wins.” Nor is it “the smartest general-purpose model wins.” The paper’s more useful finding is messier: live retrieval helps with recency; reasoning-oriented systems help with analytical depth; conservative refusal can reduce hallucination while also reducing usefulness; and domain-specific systems may perform better when structured financial data is connected to reasoning logic.

That is less convenient than a leaderboard. It is also closer to how actual research desks work.

The wrong question is “which model is best?”

The common buyer question for financial AI is usually phrased as a ranking problem: which model should we use?

That question is too crude. It treats “financial intelligence” as a single capability, as if a model that remembers the latest earnings announcement must also be good at valuation reasoning, or a model that explains weighted average cost of capital must also know this quarter’s asset-quality data. Finance refuses to be that tidy.

AFIB evaluates financial AI across several practical dimensions: factual accuracy, hallucination resistance, analytical depth, completeness, recency, consistency, and failure behavior. The paper describes five core evaluation dimensions in the methodology, while the figures often display six normalized capability dimensions by separating related concepts such as accuracy and hallucination resistance. That labeling is slightly untidy, but the underlying point is clear enough: financial AI must be judged as a portfolio of capabilities, not as a single score wearing a suit.

The benchmark uses structured financial-analysis questions based on real-world equity research tasks. The abstract describes the dataset as “95+” structured questions, while the main text later refers to 71 financial-analysis queries. Rather than forcing a false precision, the safest reading is that the controlled benchmark contains a curated set of structured financial prompts, supplemented by 432 negatively rated production responses from a deployed financial AI system. That second dataset matters because it moves the paper beyond classroom-style testing into observed failure modes. Production users are very good at finding edge cases. Unfortunately.

AFIB tests the analyst workflow, not just financial vocabulary

Many finance benchmarks ask whether a model can answer questions from filings or extract figures from tables. Useful, yes. Sufficient, no.

An analyst workflow is not just retrieval. It usually includes at least five steps:

Workflow requirement	What can go wrong in an LLM answer	Why it matters
Retrieve the right financial period	The model uses last year’s figures or mixes quarters	The analysis may be stale before it begins
Reconcile numerical claims	Segment totals do not match consolidated figures	The story becomes mathematically false
Interpret business drivers	The answer lists ratios without explaining causes	The output is descriptive, not analytical
Handle uncertainty	The model invents precision where data is incomplete	False confidence enters the decision process
Repeat reliably	The model changes numbers or conclusions across sessions	Research workflows lose reproducibility

AFIB is interesting because it tries to measure several of these requirements together. It does not merely ask whether the model “knows finance.” It asks whether the model can assemble a defensible answer under conditions that resemble investment research: recent data, multiple metrics, business interpretation, and repeated prompts.

That is why the benchmark’s comparison structure is more valuable than its headline ranking. A procurement team or product manager should not read the paper as “buy the top model.” They should read it as a capability map.

The leaderboard is useful, but the spread matters more than the order

The overall benchmark result ranks SuperInvesting first, followed by Gemini, Perplexity, GPT, and Claude. In the paper’s composite leaderboard, the normalized scores are:

Model/system	Composite score
SuperInvesting	84.1
Gemini	72.3
Perplexity	61.5
GPT	47.1
Claude	46.3

A lazy reading would stop here. SuperInvesting wins, Gemini follows, Perplexity is mid-table, GPT and Claude trail. Done. Add a green checkmark, publish the vendor slide, and quietly hope nobody asks about methodology.

The more useful reading is that the models fail in different ways.

SuperInvesting performs strongly across most categories, particularly accuracy, completeness, analytical depth, consistency, and hallucination resistance. Gemini performs well on reasoning-heavy tasks but is weaker on recency. Perplexity leads on recency because retrieval is its architectural advantage, but it performs less well on analytical depth and consistency. Claude shows high hallucination resistance, but partly because it is more conservative and less complete. GPT shows moderate reasoning ability but weaker numerical reliability.

This matters because failure mode is not a technical footnote. It determines workflow design. A model that refuses too often slows analysts down. A model that retrieves fresh data but cannot synthesize it creates a research intern with a browser and no judgment. A model that confidently fabricates numbers is worse: it creates an illusion of due diligence.

The spreadsheet may forgive many things. It does not forgive invented denominators.

Retrieval wins recency, but recency is not analysis

The paper’s most business-relevant comparison is the trade-off between recency and analytical depth.

Perplexity performs best on data recency. That is unsurprising because retrieval-oriented systems are built to access current information. In market-facing workflows, this is a real advantage. A model that does not know the latest quarter, policy decision, or earnings release can reason beautifully about the wrong world. Very elegant, very useless.

But the benchmark also shows that strong recency does not automatically produce strong financial reasoning. Perplexity’s recency score is high, while its analytical-depth score is much lower. Gemini shows the opposite pattern: stronger analytical depth, weaker recency. SuperInvesting appears as a partial exception, with relatively strong performance on both.

This is the paper’s central architectural lesson. Retrieval and reasoning are complementary, not interchangeable.

System pattern	Strength	Weakness	Practical design implication
Retrieval-first	Current information and source discovery	Shallow synthesis, inconsistent aggregation	Needs financial reasoning layer and validation logic
Reasoning-first	Structured interpretation and valuation logic	Stale data risk	Needs trusted data access and timestamp discipline
Conservative safety-first	Lower fabrication risk	Refusal and incomplete coverage	Needs task-specific answer policies
Structured finance hybrid	Better balance across data and reasoning	Requires domain data engineering	More plausible architecture for research copilots

For business users, this distinction should shape product design. A financial AI assistant should not simply “turn on web browsing” and call itself investment-grade. Current data must be normalized, verified, timestamped, and then passed into reasoning workflows that understand financial relationships.

Otherwise the system is just a search engine with better manners.

Hallucination is not one problem; it has several accounting treatments

The paper’s hallucination analysis is especially useful because finance has a special class of hallucination: fabricated numerical authority.

In the hallucinated numerical-value count, GPT produces the most hallucinated figures, followed by Gemini, Perplexity, Claude, and SuperInvesting. The figure reports approximately:

Model/system	Hallucinated numerical values
GPT	25
Gemini	12
Perplexity	7
Claude	5
SuperInvesting	3

But the more interesting result is not just the count. It is the behavior behind the count.

Claude shows strong hallucination resistance, but the paper notes that this partly comes from conservative refusal. That is safer than fabrication, but not the same as financial competence. A calculator that refuses every division-by-zero error is safe; a calculator that refuses ordinary division is a paperweight with compliance training.

This distinction matters for governance. In financial AI systems, “low hallucination” can mean at least three different things:

Low-hallucination behavior	What it means	Business interpretation
Verified correctness	The system retrieves and reasons from accurate data	Strongest form of reliability
Conservative refusal	The system avoids answering when uncertain	Safe but may reduce productivity
Vague non-commitment	The system avoids numerical claims	Low apparent risk, low analytical value

A benchmark that only counts hallucinations may accidentally reward refusal. AFIB partially handles this by also scoring completeness and analytical depth. That is the right direction. In production, reliability should not mean “never answer.” It should mean “answer when grounded, show uncertainty when needed, and refuse only when the task cannot be supported.”

The difference is small in wording and large in operational value.

The appendix examples show the benchmark’s real texture

The appendix is not a second thesis. It is a useful diagnostic window into the benchmark.

The examples include prompts about Reliance Industries’ EBITDA contribution, SBI versus ICICI Bank asset quality, Larsen & Toubro weighted fundamental scoring, and Asian Paints’ ROIC-WACC spread. These are not exotic quant tasks. They are ordinary analyst questions with enough structure to punish shallow answers.

The Reliance example is particularly revealing. GPT reports segment percentages of 43%, 28%, and 29%, then incorrectly concludes that the three segments represent 100% of consolidated EBITDA. The arithmetic is neat. The business logic is not. The issue is not that the model cannot add. It is that segment-level financial analysis requires awareness of reporting structure, aggregation, and what is excluded from the chosen categories.

SuperInvesting, by contrast, provides a range-based segment analysis and estimates the combined contribution around 88–90%. Gemini also gives a structured estimate around 88%. Perplexity attempts calculation using annual-report figures but runs into aggregation inconsistency. Claude gives a coherent explanation but uses older data.

That single example captures the whole paper:

GPT: fluent but numerically unsafe.
Gemini: analytically stronger, but not always current.
Perplexity: current-data access, weaker aggregation discipline.
Claude: careful, sometimes too stale or incomplete.
SuperInvesting: strongest structured financial workflow in this benchmark.

The SBI versus ICICI prompt shows a different failure pattern. A model may explain provision coverage ratio correctly while using older or approximate figures. That is a dangerous half-success. The concept is right; the evidence is weak. In finance, conceptual correctness does not rescue stale inputs.

The L&T weighted-score example adds another layer. Some systems produce scores using assumed or sector-average inputs. That may look analytical because the output has a number out of 100. But a weighted score with hypothetical inputs is not a calculation; it is a costume party for a calculation.

This is why AFIB’s design is more useful when read through failure patterns than through rankings alone.

The robustness evidence supports patterns, not universal laws

The paper treats the cross-module consistency of rankings as evidence that the results are not driven by a single artifact. SuperInvesting ranks first in four of five benchmark modules. Perplexity leads only in data recency. Gemini is consistently strong in reasoning-oriented categories. Repeated-question experiments show that SuperInvesting and Claude produce more stable responses, while Perplexity varies more, likely because retrieval results fluctuate.

These tests are best understood as robustness and stability checks. They support the claim that the observed capability profiles are not random one-off outcomes inside this benchmark.

They do not prove a universal ranking of financial AI systems.

That boundary is important. The benchmark is focused on Indian equities, structured prompts, and a specific evaluation period during FY2025–26. Model interfaces and capabilities change quickly. Retrieval systems improve. General models gain tool access. Domain products update their data pipelines. A leaderboard in this market ages like milk, just with more API calls.

So the durable contribution is not the exact ranking. The durable contribution is the evaluation logic: financial AI should be compared across factual accuracy, recency, reasoning depth, completeness, consistency, and failure modes.

What this means for financial AI product design

The paper’s practical lesson is straightforward: do not build financial AI as a single chatbot wrapped around a general model.

A serious financial AI assistant needs at least four layers.

First, it needs a trusted data layer. This includes filings, exchange disclosures, earnings releases, macroeconomic publications, and market data. The system must know not only the value, but the reporting period, source, unit, currency, restatement status, and whether the figure is consolidated, standalone, segment-level, trailing, forward, or adjusted. Details. The little things that prevent expensive nonsense.

Second, it needs a calculation and reconciliation layer. Segment totals should be checked. Ratios should be recomputed. Weighted scores should expose inputs and formulas. If the model says three business segments contribute 100% of EBITDA when they do not, the system should catch that before the prose becomes persuasive.

Third, it needs a reasoning layer. Retrieval alone does not explain margin compression, return on invested capital, provisioning buffers, or valuation spread. The system must connect numbers to business drivers and sector structure. Otherwise it is just a well-indexed filing cabinet.

Fourth, it needs a response policy. The assistant should know when to answer, when to provide ranges, when to disclose uncertainty, and when to refuse. A refusal policy copied from general safety training is not enough. Finance needs task-specific epistemic discipline: verify, calculate, caveat, or stop.

Here is the product implication in one table:

AFIB finding	Direct paper evidence	Cognaptus interpretation	Boundary
Financial intelligence is multi-dimensional	Models differ across accuracy, recency, depth, completeness, consistency, and hallucination resistance	Procurement should evaluate workflow fit, not generic model rank	Benchmark scope is Indian equities and structured questions
Retrieval improves recency	Perplexity leads the recency benchmark	Live data access is necessary for market workflows	Retrieval does not guarantee correct synthesis
Reasoning-oriented models still matter	Gemini performs strongly on analytical depth	Financial interpretation needs structured reasoning, not just search	Weak recency can make good reasoning stale
Refusal reduces hallucination but may reduce usefulness	Claude has strong hallucination resistance but lower coverage	Governance should distinguish safe refusal from verified correctness	Low hallucination alone is an incomplete metric
Hybrid systems look strongest in this benchmark	SuperInvesting ranks first overall and in four of five modules	Data pipelines plus domain reasoning are likely the better architecture	Vendor-specific results need independent replication

This is not a marketing slogan for “AI in finance.” It is almost the opposite. The benchmark suggests that financial AI becomes useful only when it is less magical and more engineered.

Where the result should not be overread

The paper gives a useful benchmark, but its boundaries matter.

First, the market context is Indian equities. That is a rich domain, but it is not the same as U.S. equities, Chinese A-shares, fixed income, private credit, derivatives, or crypto markets. Each domain has different data structures, disclosure practices, update cycles, and reasoning requirements.

Second, the benchmark uses structured questions rather than full end-to-end research workflows. Institutional research often involves document collection, spreadsheet modeling, scenario analysis, peer comparison, committee discussion, and revision. AFIB captures important components of that workflow, not the whole machine.

Third, the model comparison is time-sensitive. The paper itself frames the evaluation as a snapshot from FY2025–26. This is exactly right. Financial AI rankings will move as models gain better retrieval, tool use, numerical verification, and domain adaptation.

Fourth, SuperInvesting’s strong performance should be interpreted as evidence within this benchmark, not as a universal product endorsement. The result is still useful, but serious buyers should replicate evaluation on their own asset classes, documents, languages, and decision processes. Trust, but make the benchmark run on your own messy data.

The benchmark is really about workflow risk

The best way to read AFIB is not as a contest among model brands. It is a map of workflow risk.

A financial AI system can fail by being stale, shallow, numerically wrong, inconsistent, incomplete, or too cautious to be useful. Different architectures reduce different risks. Retrieval reduces staleness. Domain data reduces factual drift. Reasoning models improve synthesis. Refusal policies reduce dangerous fabrication. Verification layers catch arithmetic and aggregation errors.

The future financial AI stack will likely combine all of them. Not because hybrid architecture sounds fashionable, but because finance is an unforgiving integration problem. The answer must be current, correct, interpretable, repeatable, and aware of uncertainty. Missing one of those properties can break the workflow.

That is the real lesson from AFIB. The question is not whether an LLM can “talk finance.” Many can. The question is whether the system can sustain financial discipline when the answer requires fresh data, exact numbers, structured reasoning, and the humility to stop before fiction becomes analysis.

Show me the money, yes.

But first, show me the reconciliation.

Cognaptus: Automate the Present, Incubate the Future.

Akshay Gulati et al., “Evaluating Financial Intelligence in Large Language Models: Benchmarking SuperInvesting AI with LLM Engines,” arXiv:2603.08704, 2026, https://arxiv.org/abs/2603.08704. ↩︎

The wrong question is “which model is best?”#

AFIB tests the analyst workflow, not just financial vocabulary#

The leaderboard is useful, but the spread matters more than the order#

Retrieval wins recency, but recency is not analysis#

Hallucination is not one problem; it has several accounting treatments#

The appendix examples show the benchmark’s real texture#

The robustness evidence supports patterns, not universal laws#

What this means for financial AI product design#

Where the result should not be overread#

The benchmark is really about workflow risk#