TL;DR for operators

Financial foundation models are not one product category. They are three partly overlapping tool families, and confusing them is how firms end up buying a chatbot and expecting a risk engine.

The paper reviewed here offers a useful taxonomy of financial foundation models across language, time-series, and visual-language systems, covering architectures, training methods, datasets, applications, and deployment challenges through June 2025.1 Its practical value is not that it declares a winner. It does something more useful: it shows which parts of financial AI are mature enough for workflow adoption, which are still research-shaped, and where the real bottlenecks sit.

For near-term deployment, financial language foundation models are the most mature. They are credible candidates for financial document question answering, earnings-call summarisation, disclosure review, compliance support, entity extraction, multilingual financial understanding, and retrieval-augmented advisory interfaces. This does not mean they are magically trustworthy. It means their data formats, training pathways, and benchmarks are less chaotic than the alternatives.

Financial time-series foundation models are more interesting than comfortable. They target forecasting, value-at-risk, volatility, order-flow simulation, and market-regime analysis. But finance is temporally hostile: leakage, non-stationarity, regime shifts, and backtest fragility make ordinary AI evaluation look almost charmingly innocent. In this category, the key operational question is not “Can the model predict?” but “Was the model allowed to know the future?”

Financial visual-language foundation models are the youngest category. They aim to understand charts, tables, filings, scanned reports, candlestick diagrams, and policy graphics together with text. That matters because finance is not written only in paragraphs. It is also written in tables that hide assumptions, charts that compress narratives, and regulatory documents that make sleep look attractive. The constraint is scale: current financial multimodal datasets are useful for evaluation, but often too small for serious large-scale pre-training.

The business interpretation is simple. Use the taxonomy as a deployment map. Put FinLFMs closest to production, use FinTSFMs carefully in research and risk-support settings, and treat FinVLFMs as an emerging layer for document intelligence. Do not confuse any of this with permission to let a model trade autonomously while everyone else goes for coffee. That is not innovation. That is negligence with a nicer dashboard.

The mistake is treating financial AI as one chatbot

A familiar boardroom fantasy goes like this: take a large language model, fine-tune it on filings, connect it to market data, and suddenly the firm owns a tireless junior analyst, risk officer, quant researcher, and compliance associate. Preferably all for less than the cost of one Bloomberg terminal and a slightly disappointing sandwich budget.

The survey punctures that fantasy without needing to shout. Financial AI is not a single “LLM for finance” problem. It is a family of modelling problems divided by modality, time, evidence, and institutional risk.

Financial text is one thing. Market microstructure is another. A central-bank dot plot is another. A balance-sheet table embedded in a PDF is another. A client transaction history governed by privacy rules is definitely another. Finance is full of representations that look superficially compatible because they can all be turned into tokens, but operationally they behave very differently.

That is why the paper’s main contribution is a taxonomy rather than another performance leaderboard. It separates financial foundation models into three groups:

Category Core input Typical use Current maturity Main operational trap
Financial language foundation models, or FinLFMs Text: reports, news, filings, contracts, transcripts QA, summarisation, compliance checks, entity extraction, advisory support Highest Mistaking fluency for factual grounding
Financial time-series foundation models, or FinTSFMs Prices, returns, order flow, macro indicators, multivariate sequences Forecasting, VaR, volatility, market simulation, ranking Early but active Lookahead bias and unstable regimes
Financial visual-language foundation models, or FinVLFMs Charts, tables, scanned reports, visual financial artefacts plus text Chart QA, table reasoning, document parsing, multimodal reporting Nascent Tiny benchmarks and weak visual-financial alignment

This categorisation matters because the adoption path is different for each. A model that reads a 10-K well is not automatically good at forecasting volatility. A model that forecasts returns is not automatically good at parsing a balance-sheet table. A model that can describe a chart is not necessarily capable of doing numerical reasoning over the chart without hallucinating a very confident lie. Finance, as usual, insists on being inconvenient.

What the paper actually does: a map, not a model race

The paper is a survey, not a new benchmark paper. That matters for interpretation. It does not introduce a new model and prove that it beats a rival across a controlled test suite. Its evidence is instead a structured synthesis of models, datasets, training strategies, and applications.

The authors review the development of financial foundation models from the broader foundation-model era beginning around 2018 through June 2025. They compare this survey against nine prior surveys and argue that their coverage is broader because it includes the three major subfields—FinLFMs, FinTSFMs, and FinVLFMs—alongside datasets, applications, and challenges.

That makes the paper useful for strategic planning, but not as a procurement shortcut. A survey can tell you where the field is moving. It cannot tell you that a particular vendor model will survive your audit committee, your latency budget, your regulator, and your data leakage tests. Some miracles remain outside the literature review format.

The paper’s internal evidence has different roles:

Paper component Likely purpose What it supports What it does not prove
Taxonomy and history figures Conceptual framing The field is splitting by modality and task type That one category is universally superior
Training tables for FinLFMs, FinTSFMs, and FinVLFMs Comparative inventory and implementation detail The maturity gap between model classes That listed models are production-ready
Dataset tables Resource landscape Dataset availability differs sharply by modality That benchmark coverage reflects real financial workflows
Application table Literature synthesis and comparison with prior work Current applications cluster around extraction, prediction, decision support, and simulation That autonomous financial agents are safe
Challenge section Boundary analysis Adoption blockers are data, algorithms, and infrastructure That these blockers are solved

This distinction is not pedantry. It is the difference between reading the paper as a roadmap and reading it as a shopping list. The former is useful. The latter is how organisations accidentally build a compliance chatbot that can quote policy documents and invent policy exceptions in the same breath.

FinLFMs are closest to production because finance still runs on documents

Financial language foundation models are the most mature part of the field because the modern financial system produces a majestic quantity of text. Annual reports, analyst notes, earnings calls, regulatory filings, contracts, policy documents, risk disclosures, meeting transcripts, audit comments, news, and client communications all beg to be searched, summarised, classified, translated, compared, and occasionally rescued from their own prose.

The paper reviews 21 representative FinLFMs and divides their evolution into three broad groups: BERT-style models, GPT-style models, and reasoning-enhanced models. The sequence is intuitive. Early FinBERT-style systems were mainly discriminative: sentiment analysis, classification, named entity recognition, and other structured NLP tasks. GPT-style systems brought instruction following and generation. More recent reasoning-enhanced systems add chain-of-thought-style training and reinforcement-learning alignment for complex financial reasoning.

The training pipeline is also relatively legible. FinLFMs increasingly follow the familiar pattern of pre-training or continued pre-training, supervised fine-tuning, and alignment. The survey’s training table includes examples ranging from BloombergGPT, trained at large scale on a mixture of general and financial tokens, to open-source or institution-specific models using LLaMA, Qwen, InternLM, BLOOM, Mistral-style, or other backbones. The exact architecture is less important than the pattern: general capability is inherited from broad foundation models; financial competence is added through domain corpora, instruction datasets, retrieval, and alignment.

For operators, the key insight is that FinLFM value is mostly workflow value before it is prediction value. The strongest near-term use cases are not “beat the market by reading the news faster.” They are duller and more valuable:

Workflow Why FinLFMs fit Required control
Financial document QA Text-heavy, retrieval-friendly, auditable Source-grounded answers with citations
Report summarisation Repetitive analyst and compliance workload Section-level traceability and human review
Risk disclosure analysis Pattern recognition across filings and policies Controlled taxonomy and escalation rules
Contract and covenant extraction Semi-structured legal-financial language Precision testing and exception handling
Multilingual financial support Cross-language regulatory and investor materials Jurisdiction-specific terminology checks
Internal knowledge assistants Large stores of policies, memos, and reports Retrieval boundaries and access controls

The misconception to avoid is that a finance-specific language model is trustworthy because it has read finance. Reading finance is not the same as respecting finance. The paper’s alignment discussion is important here: finance-specific systems increasingly need factual accuracy, regulatory consistency, and reasoning transparency. That is not optional garnish. It is the difference between a useful assistant and a liability generator with a pleasant tone.

The business implication is therefore not “replace analysts.” It is “compress the document layer of financial work while preserving auditability.” In a serious deployment, the model should behave less like a charismatic intern and more like a controlled interface to verified financial knowledge. Less poetry. More provenance.

FinTSFMs are where prediction meets temporal hygiene

Financial time-series foundation models are a different animal. They are designed for sequential financial data: prices, returns, order flow, exchange rates, volatility, macro indicators, and related multivariate histories. The paper identifies seven representative FinTSFM efforts and separates them into models trained directly on time-series data and models adapted from language models.

This is where the taxonomy earns its keep. A language model can discuss a market regime. A time-series model must survive one.

The paper describes models trained from scratch on time-series data, such as MarketGPT for order-flow sequences, and adaptations of broader time-series models such as TimesFM into financial settings. It also covers approaches that adapt language-model architectures to temporal data, including techniques that reprogram time-series segments into prompt-like representations or fine-tune GPT-style models on multivariate sequences. The field is energetic, but it is not yet settled. Unlike FinLFMs, there is no widely standardised training pipeline equivalent to pre-training, supervised fine-tuning, and alignment.

The dataset picture explains why. Financial time-series datasets are harder to make useful than they look. A price table is not automatically a good benchmark. Realistic evaluation needs transaction costs, regime diversity, non-stationarity, temporal splits, asset universes, survivorship control, and clarity over what information was available when. Otherwise the model may be learning finance in the same way a student “learns” the answer key.

The paper highlights datasets such as long-run S&P 500 index data, exchange-rate series, Bitcoin price histories, FNSPID with news and stock prices, and FinTSB, a benchmark designed around market patterns such as uptrends, downtrends, volatility, and black-swan settings. That progression is meaningful: the field is moving from simple historical series toward more realistic, structured benchmark conditions.

For business use, FinTSFMs should be treated as decision-support infrastructure, not autonomous alpha machines. Their plausible applications include risk forecasting, scenario generation, volatility modelling, left-tail value-at-risk support, market-regime detection, and signal enrichment. They may also be useful inside simulation engines where generating plausible order-flow or price-path behaviour is more important than producing a single trade recommendation.

The boundary is severe: temporal leakage can invalidate the entire exercise. The paper explicitly discusses lookahead bias, where a model trained on broad corpora may have absorbed information published after the evaluation period. A model evaluated on 2020 data but trained on post-2020 material is not forecasting. It is time travel with extra steps.

For operators, the minimum serious checklist looks like this:

Question Why it matters
Was the training corpus time-bounded relative to the test period? Prevents future information leakage
Are assets selected using only information available at the time? Avoids survivorship and selection bias
Are transaction costs and liquidity constraints included? Converts toy prediction into tradable relevance
Are results tested across regimes? Prevents overfitting to calm markets
Are forecasts evaluated separately from portfolio decisions? Distinguishes prediction skill from allocation luck
Can the model adapt without forgetting prior regimes? Addresses non-stationarity and structural breaks

The glamorous phrase is “financial time-series foundation model.” The practical phrase is “temporally disciplined forecasting component.” The second one is less exciting. It is also more likely to survive contact with a risk committee.

FinVLFMs matter because financial knowledge is often visual

The third category, financial visual-language foundation models, is the easiest to underestimate. Many executives hear “visual-language” and think of chart captions. That is too small.

Financial communication is deeply visual. Earnings reports use tables. Central banks use dot plots and projection charts. Investor presentations use waterfall charts, segment graphics, and carefully optimistic axes. Trading systems use candlestick diagrams and order-book visualisations. Regulatory filings arrive as documents where structure, layout, table boundaries, footnotes, and numeric alignment matter.

FinVLFMs attempt to process these visual artefacts together with text. The paper frames them as models for visual question answering, document parsing, chart interpretation, and multimodal reasoning. It describes a common three-part architecture: a vision encoder, a vision projector, and a base language model.

That architecture is simple enough to be dangerous if misunderstood. The vision encoder converts images into embeddings. The projector aligns those embeddings with the language model’s token space. The base LLM performs the reasoning and generation. In ordinary images, that might be enough for many tasks. In finance, the difficulty is that a tiny visual detail may be the difference between a correct answer and a very expensive misunderstanding.

A candlestick wick, a table subtotal, a negative sign in parentheses, a footnote marker, a scale break, or a quarterly column heading is not decoration. It is data.

The paper reviews representative FinVLFMs such as FinVis-GPT, FinTral, and FinLLaVA. These systems generally use a two-stage training paradigm: modal alignment pre-training followed by supervised fine-tuning. Their training datasets include hundreds of thousands to over a million visual question-answer pairs in some cases, while many evaluation benchmarks remain much smaller.

The dataset constraint is the main story. The paper lists financial visual-language datasets covering credit tables, numerical question answering over text and tables, chart QA, multimodal finance evaluation, multi-hop VQA, and newer benchmarks such as FinMME. But many current datasets contain only hundreds to a few thousand question-answer pairs. That is useful for evaluation and early development. It is not enough to declare that the multimodal layer of finance has been solved.

For business, FinVLFMs are promising where documents and visuals collide:

Use case Operational value Why current systems need caution
Earnings-table QA Faster extraction from reports Multi-step numerical reasoning can fail silently
Chart interpretation Converts visual trends into searchable analysis Axis, scale, and resolution errors matter
Regulatory-document parsing Supports compliance review over scanned or formatted documents Layout and footnotes are hard
Investor-deck analysis Links narrative claims to visual evidence Marketing charts are often designed to persuade
Central-bank communication analysis Combines statements, projections, and charts Policy interpretation needs context beyond visuals

The right near-term posture is controlled augmentation. Use FinVLFMs to extract, compare, retrieve, and flag. Use humans and deterministic checks for final numbers. The model can point to the right table and suggest the relationship. It should not be the only entity deciding whether the ratio breached a covenant. That is what spreadsheets, auditors, and professional anxiety are for.

Applications cluster around extraction first, prediction second, autonomy last

The paper’s application section is especially useful because it prevents a common inversion. In public discussion, financial AI quickly becomes a story about trading agents. In the literature surveyed here, the more grounded applications begin with knowledge extraction.

The paper groups representative applications into four categories: financial knowledge extraction, market prediction, trading and financial decision-making, and agent-based financial simulation.

That ordering is not accidental. Knowledge extraction is the foundation. Before a model can support forecasting, advising, or simulation, it needs to transform unstructured text, tables, charts, and reports into usable representations. This is where domain-specific systems can be genuinely valuable. A model that reliably extracts entities, relationships, metrics, disclosures, and table values can improve many downstream workflows without pretending to be Warren Buffett in a GPU cluster.

Market prediction is more fragile. The survey includes studies using general-purpose models and emerging domain-specific models for tasks such as VaR forecasting, stock ranking from multi-source signals, sentiment-driven timing, and qualitative portfolio insight generation. The direction is plausible: foundation models can integrate heterogeneous information better than many narrow systems. But plausibility is not production evidence. Backtests must be read like legal contracts: slowly, suspiciously, and with attention to what is missing.

Trading and decision-making applications move closer to high-stakes deployment. The paper discusses systems involving retrieval-augmented financial assistants, memory-based trading agents, interpretable factor extraction, and GPT-assisted portfolio construction. These are valuable research directions, but they also intensify the need for controls. Once a model’s output affects allocation, advice, or risk treatment, explainability and compliance stop being nice-to-have features.

Agent-based financial simulation is perhaps the most conceptually interesting and practically immature category. Multi-agent systems can simulate investors, traders, market makers, and risk officers. They may help test hypotheses about market behaviour or stress-test decision rules. But simulation realism depends on whether agents behave like financial actors, not improv theatre participants wearing “trader” badges.

A useful deployment ladder emerges:

Stage Suitable model role Business risk
Extraction Convert documents, tables, charts, and news into structured knowledge Moderate, controllable with validation
Explanation Summarise, compare, and justify financial information Moderate, requires grounding
Forecast support Add signals for risk, volatility, ranking, or scenarios High, requires temporal integrity
Decision support Assist allocation, advisory, compliance, or risk actions Very high, requires audit and governance
Autonomous action Execute or recommend trades with limited human control Extreme, usually unjustified outside constrained research

The sensible enterprise path starts at the top of the ladder and earns its way down. Unfortunately, “we automated disclosure extraction” is less exciting than “our agents trade like a hedge fund.” It is also far less likely to become a regulatory incident.

The adoption map: match model class to workflow friction

The paper’s strongest business use is as a classification system for adoption decisions. It helps firms ask the right question before they ask the expensive one.

The wrong question is: “Which financial foundation model should we buy or build?”

The better question is: “Which financial representation is causing operational friction?”

If the bottleneck is text, start with FinLFMs. If the bottleneck is temporal forecasting or market simulation, evaluate FinTSFMs under strict leakage controls. If the bottleneck is charts, tables, scanned filings, or document layout, explore FinVLFMs. If the workflow includes all three, do not pretend one model class will solve everything cleanly. Build a pipeline.

Operational friction Better-fit model class Practical architecture
Analysts spend hours reading reports FinLFM Retrieval-augmented report QA with source citations
Compliance teams review repetitive disclosures FinLFM Policy-grounded classifier plus escalation workflow
Risk teams need scenario and tail-risk support FinTSFM Time-bounded forecasting model with regime tests
Quants need signal enrichment from news and prices FinLFM + FinTSFM Separate text extraction from temporal modelling
Teams manually extract values from PDF tables FinVLFM Layout-aware parsing with deterministic reconciliation
Investor decks contain charts and narrative claims FinVLFM + FinLFM Visual extraction plus textual consistency checks
Multi-agent market experiments are being explored FinLFM + FinTSFM, possibly FinVLFM Simulation sandbox with no direct production authority

This is where the “foundation model” label can mislead. Foundation does not mean universal. It means reusable representation. In finance, reusability is constrained by modality, time, privacy, regulation, and latency. A foundation model can be broad and still be wrong for the job.

The real moat is not model size; it is governed financial context

One of the more useful signals in the paper is the repeated emphasis on data. Not just more data. Better financial data, aligned to the right modality and governed properly.

FinLFMs benefit from comparatively mature text datasets and instruction benchmarks. Financial text resources have evolved from small English-centric datasets such as sentiment and QA benchmarks toward multi-task, multilingual, cross-lingual, and more realistic datasets. That explains why this category feels more deployable.

FinTSFMs and FinVLFMs are less comfortable. Time-series resources often suffer from limited scope, short windows, inconsistent assumptions, and unrealistic evaluation. Visual-language datasets are diverse but often small, making them more suitable for evaluation than for large-scale pre-training. The paper is quite clear that data scarcity is not a side issue. It is a central bottleneck.

For financial institutions, this implies a strategic point: the durable advantage may not come from owning the largest model. It may come from owning the cleanest governed context.

That context includes:

  • proprietary but permissioned internal documents;
  • time-stamped data with strict availability boundaries;
  • validated financial knowledge graphs;
  • versioned policy and regulatory libraries;
  • labelled exceptions and compliance decisions;
  • historical forecasts with decision-time metadata;
  • multimodal document stores with table, chart, and layout annotations.

This is less glamorous than announcing a 70-billion-parameter financial oracle. It is also where production advantage usually lives. Model weights are increasingly accessible. Clean, governed, temporally honest financial context is not.

Boundaries: the paper is a roadmap, not a trading licence

The paper’s challenge section identifies three adoption bottlenecks: data, algorithms, and computing infrastructure. Those categories are broad, but the specific issues are practical.

First, data scarcity is sharpest for multimodal and time-series financial models. Large-scale chart-text, table-report, and temporally disciplined market datasets are difficult to build because financial data is structured, regulated, proprietary, and often expensive. Synthetic data can help, but synthetic finance data must be treated carefully. A synthetic candlestick chart can teach pattern recognition. It cannot automatically teach market reality.

Second, privacy and regulation are not generic enterprise concerns. They define what can be trained, shared, logged, retrieved, and explained. The paper discusses approaches such as federated learning, differential privacy, and compliance-focused alignment. These are promising, but they are not magic erasers. A model trained in a privacy-preserving way can still generate a non-compliant recommendation if its downstream controls are weak.

Third, hallucination is more dangerous in finance than in casual search. A fabricated earnings figure, false regulatory event, or invented risk exposure can damage decisions quickly. Retrieval-augmented generation and knowledge graphs are important mitigations, especially when models must answer from controlled sources. But retrieval is not the same as truth. Retrieval can fetch the wrong document, miss the relevant footnote, or surface stale policy.

Fourth, lookahead bias is lethal for forecasting claims. The paper’s example is straightforward: if a model evaluated on 2020 financial data was trained on material published after 2020, the evaluation has violated temporal integrity. This is not a minor methodological quibble. It is the difference between forecasting and accidentally reading tomorrow’s newspaper.

Fifth, cost and latency matter. The paper notes that BloombergGPT required approximately 1.3 million GPU hours on NVIDIA A100s, with an estimated cost of around USD 1–2 million. That is training cost, not the full lifecycle cost of deployment, monitoring, inference, governance, security, and model updates. In real-time trading, customer interaction, and risk monitoring, large-model latency can be operationally unacceptable.

This points toward modular architectures: large models for complex reasoning, smaller models for latency-sensitive tasks, retrieval systems for grounding, deterministic systems for calculation, and human review for high-impact decisions. The future of financial foundation models may look less like one giant brain and more like a governed committee. Annoying, but familiar to finance.

What Cognaptus infers for business use

The paper directly shows that the financial foundation model landscape is broadening across modalities and that maturity varies by category. It also shows that applications are already emerging across extraction, prediction, decision support, and simulation, but many remain exploratory and still rely on general-purpose foundation models.

Cognaptus infers three practical lessons.

First, deploy where verification is strongest. Document QA, summarisation, extraction, and compliance triage are attractive because outputs can be grounded in sources and checked against rules. These workflows can produce measurable productivity gains without requiring the model to predict markets.

Second, separate reasoning from execution. A model may generate useful investment rationales, identify factors, or summarise risk signals. That does not mean it should execute trades, approve advice, or override controls. The correct architecture keeps models inside governed decision pipelines, not outside them wearing sunglasses.

Third, evaluate by failure mode, not demo quality. In finance, the dangerous failures are often quiet: a missing date boundary, a stale filing, a chart misread, a hidden leakage path, a plausible but false statement, an answer that violates suitability rules. The evaluation suite should target these failures directly.

A practical readiness test might look like this:

Readiness dimension Minimum question
Grounding Can every material claim be traced to an approved source?
Temporal integrity Can the system prove what it knew at decision time?
Numerical reliability Are calculations reconciled against deterministic tools?
Compliance alignment Are prohibited outputs blocked and escalated?
Privacy Are client and proprietary data protected by design?
Latency Can the system meet workflow timing constraints?
Monitoring Are drift, hallucination, and retrieval failures logged?
Human control Is there a clear boundary between assistance and authority?

This is the unglamorous work that turns foundation models into financial infrastructure. It is also the work most demos skip, because watching a model obey access controls is not exactly conference-stage material.

Finance does not need one oracle; it needs disciplined model portfolios

The paper’s deeper message is that financial AI is becoming multimodal and modular. Text models, time-series models, visual-language models, retrieval systems, knowledge graphs, privacy mechanisms, and smaller deployment models will increasingly be combined into financial engineering pipelines.

That is the right direction. Finance is not a single data format, and it is not a single decision problem. It is a stack of documents, prices, policies, tables, charts, behaviours, incentives, and constraints. Any AI system pretending otherwise is not simplifying the problem. It is deleting the parts that matter.

The near-term winners will not be the firms that ask foundation models to replace financial judgment wholesale. They will be the firms that use them to reduce information friction, improve audit trails, accelerate structured analysis, and make expert review more focused. FinLFMs can make document-heavy workflows faster. FinTSFMs can enrich risk and forecasting research when temporal hygiene is enforced. FinVLFMs can open the visual layer of financial documents, once the data and evaluation mature.

The frontier, then, is not just “trained on tickers.” It is tuned for trust: grounded, time-aware, privacy-preserving, multimodal, cost-conscious, and humble enough to know when a spreadsheet should do the arithmetic.

A rare thing in AI finance, admittedly. But one can dream responsibly.

Cognaptus: Automate the Present, Incubate the Future.


  1. Liyuan Chen, Shuoling Liu, Jiangpeng Yan, Xiaoyu Wang, Henglin Liu, Chuang Li, Kecheng Jiao, Jixuan Ying, Yang Veronica Liu, Qiang Yang, and Xiu Li, “Advancing Financial Engineering with Foundation Models: Progress, Applications, and Challenges,” arXiv:2507.18577, 2025. ↩︎