Numbers Don’t Speak for Themselves: How LLMs Interpret the Soul of Financial Reports

In finance, the devil isn’t just in the details—it’s in the narrative. That’s what makes this new study by Md Talha Mohsin both timely and essential: it directly evaluates how five top-tier LLMs—GPT-4, Claude 4 Opus, Perplexity, Gemini, and DeepSeek—perform in interpreting the most linguistically dense and strategically revealing part of corporate disclosures: the Business section of 10-K filings from the “Magnificent Seven” tech giants.

Rather than focusing on raw numbers or sentiment snippets, the study asks: can these LLMs extract strategic intent, infer risk, and assess future outlooks the way human analysts do?

The Setup: Simulating the Analyst’s Mind

The experiment is grounded in realism. The author curated 21 Item 1. Business sections (7 firms × 3 years) from SEC 10-K filings and crafted 10 nuanced, open-ended questions per document. These weren’t trivia or summary tasks—they required interpretation, synthesis, and domain-specific reasoning.

Sample prompt:

“What are the company’s indicated strategic goals for the next two to three years? Explain why you choose your response.”

To minimize memory artifacts, all models were prompted in isolated sessions. Evaluations followed three tracks:

Track	Description
Human Ratings	Five raters scored relevance, completeness, clarity, conciseness, accuracy
Metric Benchmarks	ROUGE, Cosine Similarity (via SBERT), Jaccard
Behavioral Consistency	Prompt variance and inter-model agreement

Findings: GPT Wins, But the Story is Deeper

The overall winner in human evaluation was GPT-4, scoring highest across nearly all qualitative metrics. But as always, the nuance matters:

Model	Strengths	Weaknesses
GPT-4	Balanced, high relevance, great factual grounding	Slight verbosity; less replicable in lexical overlap
Claude 4	Strong semantic alignment, top factual accuracy	Slightly less complete or concise than GPT
Perplexity	Reliable across tasks, semantically strong	Mid-tier factuality; occasional vague phrasing
Gemini	High ROUGE, excels at lexical match and phrasing	Overly verbose, lower coherence in high-stakes context
DeepSeek	Most concise output	Weakest in relevance and semantic alignment

Gemini scored best in ROUGE and Jaccard, but its verbose, shallow answers consistently underperformed in human judgments—underscoring that lexical overlap ≠ comprehension.

Model Behavior: When Agreement Disappears

The most alarming finding isn’t who wins—it’s when models disagree. While GPT and Claude often converged semantically (cosine similarity > 0.8), Gemini and DeepSeek frequently diverged, especially on newer disclosures like Amazon’s 2024 filing.

A notable behavior: GPT served as the “semantic median”—its outputs had high similarity not only to ground truth, but to other models’ responses, suggesting it anchors common sense and relevance across LLMs.

Fig: Cosine similarity across models on Meta’s 2024 filing

Implications for Financial Professionals

This paper doesn’t just benchmark models—it provides a litmus test for FinNLP deployment in real-world contexts.

Takeaways:

Prompt sensitivity is real and under-discussed. Small shifts in wording yield different outputs, especially in Gemini and DeepSeek.
Lexical similarity is a poor proxy for interpretive quality. Gemini’s high ROUGE didn’t translate to actionable insights.
Hybrid usage has potential. GPT for base reasoning, Gemini for verbose extraction, Claude for fact-checking—this trinity could power explainable FinAI systems.
Domain matters. GPT’s dominance on Microsoft filings vs. lower scores on Amazon 2024 hint at varying training exposure and disclosure complexity.

Final Thoughts: Choose Your Analyst Wisely

In finance, misreading the narrative can cost millions. This study moves us toward a more scientific understanding of how LLMs read between the lines. For developers and firms alike, the message is clear: evaluating LLMs isn’t just about benchmarks—it’s about behavioral predictability, interpretive alignment, and strategic coherence.

GPT may currently lead the pack, but no model is infallible. What we need next is not just a better model—but a better understanding of how models think, reason, and fail.

Cognaptus: Automate the Present, Incubate the Future.

The Setup: Simulating the Analyst’s Mind#

Findings: GPT Wins, But the Story is Deeper#

Model Behavior: When Agreement Disappears#

Implications for Financial Professionals#

Final Thoughts: Choose Your Analyst Wisely#

The Setup: Simulating the Analyst’s Mind

Findings: GPT Wins, But the Story is Deeper

Model Behavior: When Agreement Disappears

Implications for Financial Professionals

Final Thoughts: Choose Your Analyst Wisely