GPT-O1

If your firm is debating whether to trust an LLM on investment memos, this study is a gift: 1,560 questions from official CFA mock exams across Levels I–III, run on three model archetypes—multimodal generalist (GPT‑4o), deep-reasoning specialist (GPT‑o1), and lightweight cost‑saver (o3‑mini)—both zero‑shot and with a domain‑reasoning RAG pipeline. Below is what matters for adoption, not just leaderboard bragging rights. What the paper really shows Reasoning beats modality for finance. The reasoning‑optimized model (GPT‑o1) dominates across levels; the generalist (GPT‑4o) is inconsistent, especially on math‑heavy Level II. RAG helps where context is long and specialized. Gains are largest at Level III (portfolio cases) and in Fixed Income/Portfolio Management, modest at Level I. Retrieval cannot fix arithmetic. Most errors are knowledge gaps, not reading problems. Readability barely moves accuracy; the bottleneck is surfacing the right curriculum facts and applying them. Cost–accuracy has a sweet spot. o3‑mini + targeted RAG is strong enough for high‑volume workflows; o1 should be reserved for regulated, high‑stakes analysis. Executive snapshot CFA Level GPT‑4o (ZS → RAG) GPT‑o1 (ZS → RAG) o3‑mini (ZS → RAG) Takeaway I 78.6% → 79.4% 94.8% → 94.8% 87.6% → 88.3% Foundations already in‑model; RAG adds little II 59.6% → 60.5% 89.3% → 91.4% 79.8% → 84.3% Level II exposes math + integration gaps; RAG helps smaller models most III 64.1% → 68.6% 79.1% → 87.7% 70.9% → 76.4% Case‑heavy; RAG is decisive, especially for o1 ZS = zero‑shot. Accuracies are from the paper’s aggregated results. ...