Exams are useful because they are rude. They do not care that a model sounds polished, cites the right buzzwords, or can produce a gorgeous paragraph about duration risk. They ask for A, B, or C. Then they mark the answer wrong.
That is why a new CFA-based benchmark is more useful than another misty-eyed essay about AI “transforming finance.” The paper evaluates GPT-4o, GPT-o1, and o3-mini on 1,560 official CFA mock multiple-choice questions across Levels I, II, and III, both zero-shot and with a domain-reasoning RAG pipeline built from official CFA curriculum materials.1 The result is not a single leaderboard. It is closer to a routing manual.
The headline is simple: GPT-o1 is strongest overall, o3-mini is surprisingly credible for cost-sensitive workflows, and GPT-4o is not the model you casually hand a math-heavy Level II case unless you enjoy operational theatre. More importantly, retrieval helps—but not everywhere, not equally, and definitely not because the acronym RAG has magical properties. Finance, inconveniently, still rewards knowing where the numbers came from and whether the arithmetic survived contact with reality.
The scoreboard is an operating model, not a trophy case
The paper’s core evidence is the overall accuracy table. It is the correct place to start because the business question is not “Can an LLM pass a finance quiz?” The useful question is: which model belongs in which workflow, under which augmentation strategy, at what level of risk?
| CFA level | GPT-4o zero-shot → RAG | GPT-o1 zero-shot → RAG | o3-mini zero-shot → RAG | Operational reading |
|---|---|---|---|---|
| Level I | 78.56% → 79.44% | 94.78% → 94.78% | 87.56% → 88.33% | Foundational questions are already mostly inside the models; retrieval adds little. |
| Level II | 59.55% → 60.45% | 89.32% → 91.36% | 79.77% → 84.32% | Case-based analysis and quantitative integration separate the adults from the interns. |
| Level III | 64.09% → 68.64% | 79.09% → 87.73% | 70.91% → 76.36% | Long portfolio cases are where curated retrieval starts earning its chair. |
The first pattern is model separation. GPT-o1 leads every level, and the gap is not cosmetic. On Level II, it beats GPT-4o by nearly 30 percentage points in zero-shot accuracy. That matters because Level II is where finance stops being vocabulary and becomes applied analysis: vignettes, tables, valuation logic, and enough embedded context to punish shallow recall.
The second pattern is that o3-mini is not merely a cheap toy. It trails GPT-o1, but it remains strong enough to matter: 87.56% at Level I, 79.77% at Level II, and 70.91% at Level III before retrieval. Under RAG, it rises to 84.32% on Level II and 76.36% on Level III. For high-volume financial education, document triage, junior analyst support, or first-pass internal Q&A, that is not “almost useless.” It is “probably worth routing carefully.”
The third pattern is the uncomfortable one: GPT-4o is uneven in this benchmark. Its Level I result is respectable. Its Level II result is weak. Its Level III aggregate improves with RAG, but topic-level weaknesses remain. This does not make GPT-4o a bad general model. It means “generalist multimodal flagship” is not the same thing as “trusted financial reasoning engine.” Finance departments, famously, should not confuse those categories. Yet here we are.
The benchmark is narrow enough to be useful
The dataset matters because it is neither a toy benchmark nor a full simulation of professional finance. The authors use official CFA mock exams: five 2024 Level I mocks, five 2025 Level II mocks, and five Level III mocks from 2022 to 2024. That produces 900 Level I questions, 440 Level II questions, and 220 Level III questions.
Level I questions are short: the average total length is only 146 characters. Level II and III are much longer, averaging 2,148 and 2,538 characters respectively. That escalation is not incidental. It is the benchmark’s pressure gradient. Level I checks whether the model recognises a concept. Level II and III increasingly test whether it can carry context, interpret tables, apply a rule, and avoid losing the plot halfway through the vignette.
The paper’s exam framing is therefore useful, but not absolute. CFA multiple-choice questions are a disciplined proxy for professional reasoning. They are not client suitability reviews, live portfolio construction, open-ended investment committee memos, proprietary credit models, or regulatory sign-off. The study tells us something important about financial reasoning under controlled conditions. It does not certify a model as a portfolio manager. The models may sit the CFA-style questions; they do not get a Bloomberg terminal, a client mandate, and permission to rebalance the pension fund. Sensible, really.
RAG works best when the problem is knowledge access, not arithmetic
The accepted misconception to kill early is this: RAG is not a universal finance patch.
The paper’s RAG pipeline is stronger than the lazy version of “throw documents into a vector database and hope the answer becomes adult.” It converts official CFA curriculum PDFs into markdown using MinerU, stores embeddings with OpenAI’s text-embedding-3-small in Chroma, organises the knowledge base by CFA level and topic, asks the model to generate a short query summary plus 5–10 keywords, retrieves five relevant curriculum segments, and then asks the model to answer with the retrieved context.
That pipeline is not merely an implementation detail. It is part of the experimental condition. The authors are testing whether models can use authoritative curriculum context, not just whether they memorised finance concepts during pre-training.
The result is selective rather than universal improvement. At Level I, RAG barely moves the needle. GPT-o1 does not improve at all; GPT-4o and o3-mini gain less than one percentage point. That is unsurprising. Level I is short and foundational. If the model already knows the formula or rule, extra context is mostly luggage.
At Level III, RAG becomes material. GPT-o1 improves by 8.64 percentage points. o3-mini improves by 5.45. GPT-4o improves by 4.55. This is the important business clue: retrieval pays when the task requires specialised, contextual, multi-step interpretation. It is less useful when the task is basic recall or when the failure mode is computation.
Topic-level results add the necessary mess. RAG helps Fixed Income and Portfolio Management substantially in Level III, with Portfolio Management especially notable for GPT-o1, rising from 77.50% to 92.50%. But retrieval also hurts in places. Level III Equity Investments declines under RAG for all three models: GPT-4o falls from 68.75% to 62.50%, GPT-o1 from 75.00% to 68.75%, and o3-mini from 75.00% to 62.50%.
That is the part product teams should underline. Retrieval can inject the wrong frame, over-specific context, or distracting material. A model can be confidently misled by its own carefully curated appendix. The solution is not “no RAG.” The solution is better retrieval governance: topic routing, reranking, source deduplication, snippet quality checks, and evaluation by failure mode rather than vibes. Vibes, as usual, are not an audit trail.
The figures diagnose the bottleneck rather than adding a second thesis
The paper’s visual material has a clear hierarchy. Figure 1 is an implementation diagram: it explains the domain-reasoning RAG pipeline. It is useful for reproducing the method, but it is not independent evidence that RAG works.
Figures 2 and 3 examine readability. Their likely purpose is diagnostic, almost a sensitivity check: maybe models fail because CFA questions are written in dense prose. The authors use Flesch Reading Ease and compare distributions for correct and incorrect responses. The finding is blunt: readability does not meaningfully explain accuracy differences. Level II and III questions are consistently moderately difficult, but simpler prose does not reliably make the models more correct.
That matters operationally because it redirects effort. If models are failing CFA-style finance questions, rewriting the question into plainer English is not the main fix. The bottleneck is elsewhere: missing domain facts, weak retrieval, bad formula selection, numerical execution, or answer-choice inconsistency.
Appendix figures on RAG improvements by topic appear to support the topic-level interpretation. They are best treated as exploratory visual extensions of the tables, not a second argument. The main evidence remains the accuracy tables and the error taxonomy.
The failure analysis says “knowledge first, calculators second”
The most valuable part of the paper is not that GPT-o1 wins. That was plausible before the experiment. The valuable part is the error diagnosis, because error diagnosis tells a business what to build next.
The authors classify model failures into four categories: knowledge errors, reasoning errors, calculation errors, and inconsistency errors. In the RAG condition, knowledge errors dominate. The paper reports that 61.68% of mistakes across models are knowledge errors.
That finding is easy to misunderstand. It does not mean “just add more documents.” It means the model often lacks, misremembers, or misapplies the specific curriculum fact, rule, formula, ethical standard, or relationship among concepts needed for the answer. In production terms, that points to knowledge-base coverage, source freshness, retrieval precision, and context integration—not merely model size.
The model-specific patterns are also instructive. GPT-4o has the messiest error profile. It produces substantial inconsistency errors across all levels and has more calculation trouble, especially at Level I. GPT-o1’s errors are mostly knowledge errors at higher levels, with virtually no inconsistency errors at Levels II and III. o3-mini shows a large share of reasoning errors at Level I, but at higher levels its failures are also mainly knowledge-related.
The implication is uncomfortable but useful: upgrading the model and improving retrieval solve different problems. A stronger model may select answers more coherently and compute more reliably. A better corpus may reduce missing-knowledge failures. Neither automatically substitutes for the other.
Calculation remains a separate control surface
The question-type analysis separates conceptual questions from calculation questions. The dataset is mostly conceptual: 1,186 conceptual items versus 374 calculation items across all levels. But calculation questions punch above their weight when models fail, especially for GPT-4o.
Zero-shot, GPT-4o answers only 167 of 374 calculation questions correctly, or 44.7%. It performs much better on conceptual items, answering 943 of 1,186 correctly, or 79.5%. RAG improves its conceptual performance but does not rescue its calculations; in fact, the RAG calculation count falls to 156 correct out of 374.
GPT-o1 and o3-mini are much stronger numerically. GPT-o1 answers more than 92% of calculation items correctly in both zero-shot and RAG settings. o3-mini is also strong, moving from 331 correct calculation answers zero-shot to 342 under RAG. Still, the broader lesson holds: retrieval supplies concepts, definitions, rules, and relevant context. It does not guarantee arithmetic integrity.
For a financial institution, this is the easiest operational lesson in the paper. Do not ask the language model to be the calculator of record. Let it identify the formula, explain assumptions, and draft the narrative. Route time value of money, duration, convexity, option payoffs, IRR, DCF, and scenario calculations through deterministic tools. Then ask the model to interpret the verified output.
The division of labour is not glamorous. Good. Glamour is how spreadsheets acquire hidden circular references.
The practical architecture is a tiered finance stack
The paper’s results point to a tiered deployment strategy. This is Cognaptus inference from the evidence, not something the experiment directly validates in live enterprise systems.
| Workflow type | Recommended pattern | Why the paper supports it | Boundary |
|---|---|---|---|
| High-stakes analysis | GPT-o1 + curated RAG + calculation tools + human review | GPT-o1 leads across all CFA levels and uses RAG well on complex Level III tasks. | CFA MCQs are not full regulatory or client-advisory workflows. |
| High-volume support | o3-mini + selective topic-scoped RAG | o3-mini is cost-efficient in the paper’s March 2025 pricing assumptions and performs strongly for routine and moderately complex tasks. | Cheaper inference does not remove the need for monitoring and escalation. |
| General multimodal work | GPT-4o for non-core reasoning tasks; route finance calculations elsewhere | GPT-4o is uneven on Level II and weak on calculation-heavy items. | It may still be valuable where multimodal input matters more than CFA-style reasoning. |
| Quantitative outputs | Any model + deterministic verification | RAG improves conceptual accuracy more than calculation accuracy. | Verification design must cover the actual formulas used in production. |
| Knowledge-intensive policy or curriculum Q&A | RAG-first design with source discipline | Knowledge errors dominate residual failures. | Retrieval can mislead when snippets are irrelevant, over-specific, or poorly routed. |
The cleanest implementation lesson is that finance AI should look less like a chatbot and more like a model portfolio. Different assets, different risk exposures, different rebalancing rules. Use the expensive reasoning model where mistakes are material. Use the efficient model where volume matters and consequences are limited. Use retrieval where authoritative context changes the answer. Use calculators where numbers decide the answer. Use humans where accountability is not a decorative compliance sticker.
RAG should be treated as a data product, not prompt garnish
The paper’s RAG design contains several production lessons worth stealing.
First, the knowledge base is organised by level and topic. In enterprise finance, that maps naturally to product, jurisdiction, asset class, policy type, and document authority. Retrieval should not search “the finance folder.” That is not architecture. That is rummaging.
Second, the query is generated before retrieval. The model summarises what is being asked and extracts domain keywords. This matters because raw user questions often contain noise, narrative context, or misleading terms. A retrieval planner can convert a messy analyst query into a cleaner search target.
Third, the system retrieves a small number of segments. Five references is not a universal optimum, but the design principle is sound: context packing should be disciplined. More context can mean more confusion, particularly when the retrieved passages compete.
Fourth, the evaluation is granular. The paper reports performance by level, topic, model, RAG condition, error type, readability, and question type. That is the minimum spirit required for production monitoring. “The assistant is helpful” is not a metric. “The assistant fails Level III Equity-style questions when retrieval introduces conflicting context” is a metric with a repair path.
What the paper directly shows, and what it does not
Directly, the paper shows that three OpenAI models perform very differently on official CFA mock MCQs; that GPT-o1 is consistently strongest; that o3-mini offers a strong cost-performance profile under the paper’s stated price assumptions; that RAG helps most in complex, knowledge-intensive settings; and that residual failures are dominated by knowledge errors rather than readability.
It also shows something more subtle: context is not automatically helpful. The Level III Equity Investments degradation is a useful warning because it resembles a common enterprise failure. A system retrieves something plausible, the model dutifully uses it, and the answer becomes worse. This is how “grounded AI” becomes grounded in the wrong floor.
What remains uncertain is equally important. The benchmark uses multiple-choice questions, not open-ended memo writing. It excludes the essay portion of Level III. It uses official curriculum materials, not messy internal SharePoint archaeology. It covers three models at a particular moment in the model cycle. Pricing assumptions were stated as of March 2025 and can change. It does not test live market data, proprietary positions, client constraints, or legal liability.
These boundaries do not weaken the paper. They make it usable. A controlled benchmark is not a production pilot. It is a map of where the rocks probably are.
The business value is routing, not replacement
The lazy takeaway would be: “GPT-o1 is best, so use GPT-o1.” That is technically true and operationally undercooked.
The better takeaway is that financial reasoning systems need routing logic. A single model is rarely the right unit of design. The system should decide whether a task is conceptual, computational, policy-bound, client-facing, exploratory, or high-stakes. It should then choose the right combination of model, retrieval, tool use, review, and logging.
For CFOs, risk teams, asset managers, and fintech product leaders, this changes the investment question. The question is not whether to buy the most capable model everywhere. The question is where marginal accuracy is worth marginal cost, and where architecture can do more than brute-force inference.
On this evidence, the answer is clear enough:
Use GPT-o1 where reasoning quality matters and the cost of error is high. Use o3-mini where throughput and cost discipline matter. Use RAG where specific authoritative knowledge is the bottleneck. Use deterministic tools where arithmetic decides the result. Use human review where a wrong answer becomes a real obligation.
That is not the cinematic version of AI in finance. It is the version that might actually survive procurement, audit, and the first serious incident review. Progress, in other words.
Cognaptus: Automate the Present, Incubate the Future.
-
Xuan Yao, Qianteng Wang, Xinbo Liu, and Ke-Wei Huang, “Evaluating Large Language Models for Financial Reasoning: A CFA-Based Benchmark Study,” arXiv:2509.04468, 2025, https://arxiv.org/abs/2509.04468. ↩︎