If your firm is debating whether to trust an LLM on investment memos, this study is a gift: 1,560 questions from official CFA mock exams across Levels I–III, run on three model archetypes—multimodal generalist (GPT‑4o), deep-reasoning specialist (GPT‑o1), and lightweight cost‑saver (o3‑mini)—both zero‑shot and with a domain‑reasoning RAG pipeline. Below is what matters for adoption, not just leaderboard bragging rights.

What the paper really shows

  • Reasoning beats modality for finance. The reasoning‑optimized model (GPT‑o1) dominates across levels; the generalist (GPT‑4o) is inconsistent, especially on math‑heavy Level II.
  • RAG helps where context is long and specialized. Gains are largest at Level III (portfolio cases) and in Fixed Income/Portfolio Management, modest at Level I. Retrieval cannot fix arithmetic.
  • Most errors are knowledge gaps, not reading problems. Readability barely moves accuracy; the bottleneck is surfacing the right curriculum facts and applying them.
  • Cost–accuracy has a sweet spot. o3‑mini + targeted RAG is strong enough for high‑volume workflows; o1 should be reserved for regulated, high‑stakes analysis.

Executive snapshot

CFA Level GPT‑4o (ZS → RAG) GPT‑o1 (ZS → RAG) o3‑mini (ZS → RAG) Takeaway
I 78.6% → 79.4% 94.8% → 94.8% 87.6% → 88.3% Foundations already in‑model; RAG adds little
II 59.6% → 60.5% 89.3% → 91.4% 79.8% → 84.3% Level II exposes math + integration gaps; RAG helps smaller models most
III 64.1% → 68.6% 79.1% → 87.7% 70.9% → 76.4% Case‑heavy; RAG is decisive, especially for o1

ZS = zero‑shot. Accuracies are from the paper’s aggregated results.

Where models stumble (and why it matters operationally)

  • Knowledge errors dominate (≈ two‑thirds of mistakes). For production: invest in domain‑curated corpora and precision retrieval before you scale model size.
  • Calculation errors vary by architecture. GPT‑4o misses far more numeric items; o1 and o3‑mini are ~90%+ on calculation‑typed questions. Add numeric verification layers (deterministic calc or tool‑calling) for cash flows, durations, and option payoffs.
  • Ethics vs. Economics vs. Equity. Ethics is a steady win for newer models; Equity (L3) can degrade under RAG if retrieval is noisy—use reranking + deduped snippets and source attributions in analyst UIs.

Buy‑side deployment map

When to use o1

  • Regulatory memos, investment committee writeups, IPS construction—anything where a wrong paragraph is material. Pair with RAG + calculation tools.

When to use o3‑mini

  • Screening, doc triage, first‑pass Q&A, training content, and “explain this table” tasks. Attach narrow RAG (topic‑scoped) to avoid drift.

What to avoid with 4o (for now)

  • Math‑dense Level II–style tasks without calculators/tools. Use it for multimodal workflows, but route quant steps through a calculator agent.

A minimal architecture that actually works

  1. Curriculum‑mirrored indexing. Separate vector stores by Level × Topic to reduce retrieval noise.

  2. Two‑step query planning. Ask the model to (a) summarize the needed concept and (b) emit 5–10 domain keywords. Use those for retrieval, not the raw question.

  3. Context discipline. Cap to ~5 snippets; dedupe formulas/tables; label each snippet with topic → section → page; show them in the analyst UI.

  4. Math guardrails. Route NPV, duration/convexity, option Greeks to a deterministic calculator; have the LLM explain but not compute.

  5. QA harness. Log misses by error type (knowledge / reasoning / calculation / inconsistency) to guide corpus curation and prompt tweaks.

Implementation checklist (copy‑paste to your backlog)

  • Level×Topic vector shards and strict routing
  • Query‑planner prompt with summary + keywords
  • Reranker (hybrid BM25 + dense) before context packing
  • Calculator tool for TVM, IRR, DCF, duration, Black‑Scholes
  • Answer schema: choice + 80‑word rationale + cited snippets
  • Offline eval harness with topic‑weighted pass criteria

What this means for Cognaptus clients

  • Don’t overpay by default. Start with o3‑mini + RAG + calc tools for 80% of analyst assistance. Escalate to o1 for committee‑facing output.
  • RAG is not optional for finance. But treat it like data product engineering, not prompt magic: structured sources, versioning, citations.
  • Human‑in‑the‑loop stays. Use the model to explain and cite, and to pre‑compute drafts. Keep sign‑off with licensed professionals.

Cognaptus: Automate the Present, Incubate the Future