How do we judge whether an AI is thinking like a human—or at least like a financial analyst? A new benchmark, ConDiFi, offers a compelling answer: test not just whether an LLM gets the right answer, but whether it can explore possible ones. That’s because true financial intelligence lies not only in converging on precise conclusions but in diverging into speculative futures.
Most benchmarks test convergent thinking: answer selection, chain-of-thought, or multi-hop reasoning. But strategic fields like finance also demand divergent thinking—creative, open-ended scenario modeling that considers fat-tail risks and policy surprises. ConDiFi (short for Convergent-Divergent for Finance) is the first serious attempt to capture both dimensions in one domain-specific benchmark.
ConDiFi at a Glance
Aspect | Convergent Track | Divergent Track |
---|---|---|
Task Type | Multi-step MCQ with distractors | Branching timeline generation |
Input Context | Post-2025 NYSE company news | Post-2025 macro/political/economic scenarios |
Evaluation Metric | Correctness (CCS) | 5-point rubric + structural Richness graph |
Examples | FX pass-through, legal traps, policy games | Investment foresight under real-world uncertainty |
This design forces models to both synthesize and speculate—a dual cognitive challenge that human analysts face every day.
Divergent Thinking: Where Creativity Meets Utility
The divergent tasks require LLMs to produce timeline trees of hypothetical future developments. These are scored across five axes: plausibility, novelty, elaboration, actionability, and a graph-based richness score.
Interestingly, fluency didn’t equate to originality. GPT-4o scored high in plausibility but fell behind in novelty and actionable insight. In contrast, DeepSeek-R1 and Cohere Command-A consistently produced creative and tradable forecasts.
Example: Asked to model future trade policy shifts in Asia, weaker models regurgitated clichés (“regional tensions rise”). Stronger ones proposed timeline branches like sovereign wealth funds accelerating chip realignment—a creative but plausible thesis.
One of ConDiFi’s key insights is that many LLMs elaborate fluently but don’t speculate deeply. A model may offer detailed, sector-specific comments, but if they trace obvious paths, they’re useless in volatile markets. In finance, what’s non-obvious and useful is the gold standard.
Convergent Thinking: Precision Under Pressure
The convergent test pushes models to choose the one logically sound timeline among carefully crafted distractors. Each incorrect option subtly violates factor alignment, temporal coherence, or causal entailment. These adversarial MCQs are crafted using six clever pipelines—like introducing policy counterfactuals or masking key regulatory constraints.
Models like LLaMA 4-Maverick and OpenAI’s o1 led the pack here, while DeepSeek-R1 also ranked in the top five—showing that divergent and convergent talents can coexist, but don’t always align.
What’s striking is that some models performed poorly not due to logic errors, but because they prioritized optimistic outcomes, ignoring negative signals embedded in the question. Others misunderstood the evaluation criteria entirely—mistaking realism for entailment, or neglecting procedural steps in regulatory logic.
What ConDiFi Teaches Us
-
Finance demands dual intelligence. Convergent and divergent reasoning are not substitutes. ConDiFi shows that some models are strong in one but not the other. Ideal AI agents will need both.
-
Prompting and training matter more than size. DeepSeek-R1 and Cohere Command-A outperformed GPT-4o despite being smaller. Their fine-tuning prioritized strategic creativity and tactical utility—traits often lost in generalist models.
-
Evaluating creativity needs structure. ConDiFi’s use of graph metrics (like branching factor and leaf path count) gives substance to claims of creativity. It avoids rewarding verbosity or generic filler.
-
Ensemble potential is real. The authors’ inter-model correlation analysis shows which models complement each other. For instance, DeepSeek-R1’s speculative style differs from the disciplined precision of LLaMA—making them strong ensemble candidates.
-
We need cognitive alignment, not just output alignment. Too many models generate responses that sound right but lack strategic insight. By scoring not just what they say but how they think, ConDiFi pushes evaluation closer to the real demands of financial decision-making.
Final Thought: Benchmarks That Think Ahead
ConDiFi marks a shift from knowledge tests to cognitive diagnostics. As AI systems increasingly support human judgment in finance, national security, and enterprise strategy, they must reason with nuance, uncertainty, and imagination. The old question—did the model answer correctly?—is no longer enough.
Instead, the new question is:
Did the model think like someone whose judgment you would trust with real money?
That’s the standard ConDiFi begins to enforce.
Cognaptus: Automate the Present, Incubate the Future