If your firm is debating whether to trust an LLM on investment memos, this study is a gift: 1,560 questions from official CFA mock exams across Levels I–III, run on three model archetypes—multimodal generalist (GPT‑4o), deep-reasoning specialist (GPT‑o1), and lightweight cost‑saver (o3‑mini)—both zero‑shot and with a domain‑reasoning RAG pipeline. Below is what matters for adoption, not just leaderboard bragging rights.
What the paper really shows
- Reasoning beats modality for finance. The reasoning‑optimized model (GPT‑o1) dominates across levels; the generalist (GPT‑4o) is inconsistent, especially on math‑heavy Level II.
- RAG helps where context is long and specialized. Gains are largest at Level III (portfolio cases) and in Fixed Income/Portfolio Management, modest at Level I. Retrieval cannot fix arithmetic.
- Most errors are knowledge gaps, not reading problems. Readability barely moves accuracy; the bottleneck is surfacing the right curriculum facts and applying them.
- Cost–accuracy has a sweet spot. o3‑mini + targeted RAG is strong enough for high‑volume workflows; o1 should be reserved for regulated, high‑stakes analysis.
Executive snapshot
CFA Level | GPT‑4o (ZS → RAG) | GPT‑o1 (ZS → RAG) | o3‑mini (ZS → RAG) | Takeaway |
---|---|---|---|---|
I | 78.6% → 79.4% | 94.8% → 94.8% | 87.6% → 88.3% | Foundations already in‑model; RAG adds little |
II | 59.6% → 60.5% | 89.3% → 91.4% | 79.8% → 84.3% | Level II exposes math + integration gaps; RAG helps smaller models most |
III | 64.1% → 68.6% | 79.1% → 87.7% | 70.9% → 76.4% | Case‑heavy; RAG is decisive, especially for o1 |
ZS = zero‑shot. Accuracies are from the paper’s aggregated results.
Where models stumble (and why it matters operationally)
- Knowledge errors dominate (≈ two‑thirds of mistakes). For production: invest in domain‑curated corpora and precision retrieval before you scale model size.
- Calculation errors vary by architecture. GPT‑4o misses far more numeric items; o1 and o3‑mini are ~90%+ on calculation‑typed questions. Add numeric verification layers (deterministic calc or tool‑calling) for cash flows, durations, and option payoffs.
- Ethics vs. Economics vs. Equity. Ethics is a steady win for newer models; Equity (L3) can degrade under RAG if retrieval is noisy—use reranking + deduped snippets and source attributions in analyst UIs.
Buy‑side deployment map
When to use o1
- Regulatory memos, investment committee writeups, IPS construction—anything where a wrong paragraph is material. Pair with RAG + calculation tools.
When to use o3‑mini
- Screening, doc triage, first‑pass Q&A, training content, and “explain this table” tasks. Attach narrow RAG (topic‑scoped) to avoid drift.
What to avoid with 4o (for now)
- Math‑dense Level II–style tasks without calculators/tools. Use it for multimodal workflows, but route quant steps through a calculator agent.
A minimal architecture that actually works
-
Curriculum‑mirrored indexing. Separate vector stores by Level × Topic to reduce retrieval noise.
-
Two‑step query planning. Ask the model to (a) summarize the needed concept and (b) emit 5–10 domain keywords. Use those for retrieval, not the raw question.
-
Context discipline. Cap to ~5 snippets; dedupe formulas/tables; label each snippet with topic → section → page; show them in the analyst UI.
-
Math guardrails. Route NPV, duration/convexity, option Greeks to a deterministic calculator; have the LLM explain but not compute.
-
QA harness. Log misses by error type (knowledge / reasoning / calculation / inconsistency) to guide corpus curation and prompt tweaks.
Implementation checklist (copy‑paste to your backlog)
- Level×Topic vector shards and strict routing
- Query‑planner prompt with summary + keywords
- Reranker (hybrid BM25 + dense) before context packing
- Calculator tool for TVM, IRR, DCF, duration, Black‑Scholes
- Answer schema: choice + 80‑word rationale + cited snippets
- Offline eval harness with topic‑weighted pass criteria
What this means for Cognaptus clients
- Don’t overpay by default. Start with o3‑mini + RAG + calc tools for 80% of analyst assistance. Escalate to o1 for committee‑facing output.
- RAG is not optional for finance. But treat it like data product engineering, not prompt magic: structured sources, versioning, citations.
- Human‑in‑the‑loop stays. Use the model to explain and cite, and to pre‑compute drafts. Keep sign‑off with licensed professionals.
Cognaptus: Automate the Present, Incubate the Future