In finance, the devil isn’t just in the details—it’s in the narrative. That’s what makes this new study by Md Talha Mohsin both timely and essential: it directly evaluates how five top-tier LLMs—GPT-4, Claude 4 Opus, Perplexity, Gemini, and DeepSeek—perform in interpreting the most linguistically dense and strategically revealing part of corporate disclosures: the Business section of 10-K filings from the “Magnificent Seven” tech giants.
Rather than focusing on raw numbers or sentiment snippets, the study asks: can these LLMs extract strategic intent, infer risk, and assess future outlooks the way human analysts do?
The Setup: Simulating the Analyst’s Mind
The experiment is grounded in realism. The author curated 21 Item 1. Business sections (7 firms × 3 years) from SEC 10-K filings and crafted 10 nuanced, open-ended questions per document. These weren’t trivia or summary tasks—they required interpretation, synthesis, and domain-specific reasoning.
Sample prompt:
“What are the company’s indicated strategic goals for the next two to three years? Explain why you choose your response.”
To minimize memory artifacts, all models were prompted in isolated sessions. Evaluations followed three tracks:
Track | Description |
---|---|
Human Ratings | Five raters scored relevance, completeness, clarity, conciseness, accuracy |
Metric Benchmarks | ROUGE, Cosine Similarity (via SBERT), Jaccard |
Behavioral Consistency | Prompt variance and inter-model agreement |
Findings: GPT Wins, But the Story is Deeper
The overall winner in human evaluation was GPT-4, scoring highest across nearly all qualitative metrics. But as always, the nuance matters:
Model | Strengths | Weaknesses |
---|---|---|
GPT-4 | Balanced, high relevance, great factual grounding | Slight verbosity; less replicable in lexical overlap |
Claude 4 | Strong semantic alignment, top factual accuracy | Slightly less complete or concise than GPT |
Perplexity | Reliable across tasks, semantically strong | Mid-tier factuality; occasional vague phrasing |
Gemini | High ROUGE, excels at lexical match and phrasing | Overly verbose, lower coherence in high-stakes context |
DeepSeek | Most concise output | Weakest in relevance and semantic alignment |
Gemini scored best in ROUGE and Jaccard, but its verbose, shallow answers consistently underperformed in human judgments—underscoring that lexical overlap ≠ comprehension.
Model Behavior: When Agreement Disappears
The most alarming finding isn’t who wins—it’s when models disagree. While GPT and Claude often converged semantically (cosine similarity > 0.8), Gemini and DeepSeek frequently diverged, especially on newer disclosures like Amazon’s 2024 filing.
A notable behavior: GPT served as the “semantic median”—its outputs had high similarity not only to ground truth, but to other models’ responses, suggesting it anchors common sense and relevance across LLMs.
Fig: Cosine similarity across models on Meta’s 2024 filing
Implications for Financial Professionals
This paper doesn’t just benchmark models—it provides a litmus test for FinNLP deployment in real-world contexts.
Takeaways:
-
Prompt sensitivity is real and under-discussed. Small shifts in wording yield different outputs, especially in Gemini and DeepSeek.
-
Lexical similarity is a poor proxy for interpretive quality. Gemini’s high ROUGE didn’t translate to actionable insights.
-
Hybrid usage has potential. GPT for base reasoning, Gemini for verbose extraction, Claude for fact-checking—this trinity could power explainable FinAI systems.
-
Domain matters. GPT’s dominance on Microsoft filings vs. lower scores on Amazon 2024 hint at varying training exposure and disclosure complexity.
Final Thoughts: Choose Your Analyst Wisely
In finance, misreading the narrative can cost millions. This study moves us toward a more scientific understanding of how LLMs read between the lines. For developers and firms alike, the message is clear: evaluating LLMs isn’t just about benchmarks—it’s about behavioral predictability, interpretive alignment, and strategic coherence.
GPT may currently lead the pack, but no model is infallible. What we need next is not just a better model—but a better understanding of how models think, reason, and fail.
Cognaptus: Automate the Present, Incubate the Future.