TL;DR for operators

An AI financial assistant may sound balanced, prudent, and numerate. That is not the same thing as being suitable.

The paper behind this article tests leading LLMs on 14 financial decision questions and compares their answers with human responses from a cross-national dataset covering 53 nations.1 The models mostly behave like expected-value machines on lottery-style questions. Give them a risky payoff with clear probabilities, and they often land near the mathematically neutral answer. Very tidy. Very spreadsheet. Very unlike the way many actual clients think when money is uncertain and losses feel personal.

The more interesting result appears when the questions move from lottery arithmetic to time preference. Several models produce implied impatience parameters that violate the standard economic discounting framework used by the paper. Gemini, on the present-bias side, produces a median $\beta$ above 1, which implies an unusual overweighting of the future under that framework. That does not mean Gemini has become a saintly retirement planner. It means the apparent rationality of the model can become internally strange once the task requires a coherent preference system across time.

The clustering result is also useful, but should not be over-romanticised. The LLM response profiles form a distinct group, separate from most national human profiles. Tanzania is the closest observed national neighbour in the analysis. That is an empirical proximity result, not proof that the models “think like Tanzanians”, and certainly not proof of a specific training-data origin. Correlation is still not causation, even when the dendrogram looks confident in a lab coat.

For operators building AI wealth, robo-advice, planning, underwriting, or investor education products, the practical lesson is simple: do not confuse fluent financial reasoning with client preference alignment. The model’s default answer may be mathematically neat but behaviourally mismatched. A production system needs explicit preference elicitation, suitability checks, cultural and jurisdictional calibration, scenario testing, and escalation paths. Otherwise, the product may give advice that is rational for Homo Silicus and awkward for everyone else.

The financial assistant that thinks like a calculator, until it does not

Imagine a client asks an AI adviser whether to take a guaranteed payment now or wait for a larger one later. Then the same client asks whether to accept a lottery, insure against a loss, or demand a higher payoff before taking a risk. The model answers cleanly. It explains the probabilities. It avoids panic. It says something that sounds like a CFA candidate after too much coffee.

The temptation is to say: excellent, the machine is rational.

The paper makes that conclusion harder to sustain. Its central comparison is not “AI versus no AI”. It is LLM financial behaviour versus human financial behaviour across countries. That comparison matters because financial advice is not only about calculating expected value. It is about matching decisions to real preferences: patience, fear of loss, ambiguity tolerance, household constraints, liquidity pressure, cultural context, and the strange human ability to know the right answer and still hate it.

The authors test seven LLM profiles from the GPT family, Gemini, and DeepSeek using financial decision questions adapted from Wang et al.’s INTRA dataset. The questions cover time preference, ambiguity aversion, risk preference, and loss aversion. Each model is queried through APIs in 100 stateless trials, with all 14 questions presented in the same sequence. The authors then compare median LLM responses with median country-level human responses.

This design is not trying to see whether a model can define “risk aversion”. It is trying to map the model’s revealed financial preferences. That is more operationally relevant. A chatbot that can explain diversification is easy. A chatbot whose default risk preference silently differs from the client’s is the expensive part.

The first contrast: humans price risk emotionally; LLMs often price it arithmetically

The clearest finding is in the lottery questions. On many probability-based items, the LLM medians sit close to expected value.

That sounds mundane until you remember what the human benchmark is doing. People do not usually price lotteries as if they were frictionless expected-value engines. They overweight losses, dislike ambiguity, demand compensation for uncertainty, react to scale, and often treat a $100 loss as more psychologically meaningful than a $100 gain. This is inconvenient for elegant models and highly relevant for actual wealth management. Humans, regrettably, insist on being humans.

The paper’s descriptive table shows how sharply the model responses differ on several lottery items. For example, questions where the expected-value answer is straightforward often produce identical or near-identical LLM medians. Some items show zero variance across the seven LLM profiles. The authors interpret this as a tendency toward deterministic or expected-value-aligned response strategies.

The business interpretation is not that expected value is bad. It is that expected value is only one layer of financial suitability. In an investment product, a model that defaults to risk-neutral valuation may be useful for pricing, simulation, or education. It is less safe as a direct proxy for client preference.

Paper observation What it directly supports Business meaning Boundary
LLMs often choose lottery amounts close to expected value Models display risk-neutral patterns on probability-framed financial items Useful for analytical consistency, but not enough for suitability The paper uses hypothetical survey questions, not live client decisions
Several questions show no variance across LLM medians Some model responses converge strongly across repeated trials Default model behaviour may be more rigid than user-facing tone suggests Median aggregation may hide distributional variation across the 100 trials
Human country medians vary more widely Financial preferences differ across populations Client segmentation matters; one “rational” default is not universal Country medians are coarse and should not be treated as individual profiles

Expected-value behaviour can be operationally attractive. It is stable, auditable, and easy to explain. The problem is that financial advice is not judged only by whether the spreadsheet smiled. It is judged by whether the recommendation fits the person who must live with the outcome.

The second contrast: simple lotteries flatter the models; time preference exposes the wobble

The paper becomes more interesting when it moves from lottery pricing to intertemporal choice.

The authors estimate two behavioural parameters from the time-preference questions: present bias, $\beta$, and long-term impatience, $\delta$. Under the quasi-hyperbolic framework used in the paper, the model compares $100 now with an amount $X$ in one year, and $100 now with an amount $Y$ in ten years. From those indifference values, the authors derive:

$$ \delta = \left(\frac{X}{Y}\right)^{1/9} $$

and then recover:

$$ \beta = \frac{100}{\delta X} $$

In this framework, $\delta$ should fall within $0 < \delta \leq 1$. A value above 1 implies that future utility is being valued more than present utility in a way that violates the standard discounting logic. Similarly, $\beta$ is usually interpreted within $0 < \beta \leq 1$, where lower values indicate stronger present bias.

Here the models look less like calm financial sages. The paper reports that GPT o3-mini, GPT 4.0, GPT 4.1, and DeepSeek produce $\delta > 1$. Gemini’s median $\beta$ is reported as 1.13. In other words, several models produce time-preference responses that the paper treats as economically incoherent under the chosen normative framework.

This is not a minor technical footnote. It is the difference between answering a lottery and maintaining a consistent financial preference system.

A model can calculate the expected value of a one-shot gamble and still fail to express coherent preferences across horizons. That matters because real financial advice is mostly horizon management: saving, retirement, liquidity, debt repayment, insurance, education planning, property purchases, business cash flow, and portfolio drawdown. Time is not a decorative axis in finance. It is the room where the bodies are buried.

The right operator takeaway is not “LLMs cannot reason”. That is too broad and too theatrical. The better interpretation is narrower and more useful: a model’s local numerical answer may look disciplined while its implied preference structure is unstable. Testing the answer is therefore not enough. You need to test the preference model implied by a sequence of answers.

The third contrast: the models do not become an average human; they form their own cluster

The paper also asks a more culturally loaded question: if LLMs resemble humans, which humans do they resemble?

To answer this, the authors use hierarchical clustering and PCA on the 14-dimensional response profiles. They test several linkage criteria and report that correlation-based average linkage gives the strongest silhouette score: 0.917, compared with roughly 0.552 for the other tested linkage methods. This is primarily a model-selection and robustness step. It supports the choice of the displayed dendrogram; it does not by itself explain why the cluster exists.

The clustering result is the headline-grabber. The LLM profiles form a distinct group separate from most national respondent profiles. In the PCA view, the first three principal components explain 79.2% of the variance: PC1 accounts for 48.4%, PC2 for 22.4%, and PC3 for 8.4%. The LLM group appears away from the dense cluster of most countries. Tanzania is the national profile closest to the model cluster.

This is where a sloppy reading would sprint directly into cultural storytelling. The paper itself offers a possible explanation involving human-feedback labour and East African annotators. That may be worth investigating, but it remains speculative. The authors correctly acknowledge that clustering proximity cannot causally identify training processes without access to training data.

The more robust business reading is simpler: the LLMs are not a neutral average of global financial preferences. They occupy a distinctive behavioural position in this benchmark.

That matters because many AI finance products implicitly sell the model as a general-purpose reasoning layer. But if the default behavioural profile is not representative of the client base, then the product is not merely “using AI”. It is importing a synthetic preference prior. Quietly. Politely. With bullet points.

The evidence map: what each analysis is actually doing

The paper uses several empirical components. They are not all doing the same job.

Analysis Likely purpose What it supports What it does not prove
Descriptive statistics across 14 items Main evidence LLM medians differ from country medians and often align with expected-value answers That models would behave the same in personalised advisory conversations
Repeated stateless API trials Implementation design Responses are not driven by chat history or previous trial memory Full reproducibility across temperatures, prompts, model versions, or providers
Hierarchical clustering Main evidence LLM response profiles form a distinct group under the selected distance/linkage method Cognitive equivalence between LLMs and any human population
Silhouette comparison across linkage methods Robustness/model-selection check Correlation-based average linkage is better supported for the displayed clustering That the resulting clusters have causal interpretation
PCA and K-means with three clusters Main evidence and interpretation aid The LLM cluster separates visually and statistically in reduced dimensions That three clusters are the only meaningful behavioural taxonomy
Figure 3 question contributions Explanatory diagnostic Risk items, time-preference items, and ambiguity/loss-related items load differently on components A standalone behavioural theory of model finance
Present-bias and impatience estimation Main evidence for intertemporal coherence Several LLMs imply abnormal $\delta$ or $\beta$ values under the paper’s framework A complete diagnosis of model reasoning ability

This distinction is important because business readers often ask the wrong question: “Is the model good or bad?”

The paper is not that kind of scoreboard. It is a behavioural audit. It says the model’s financial persona can be measured, and that persona is not automatically human, global, or internally consistent.

The product risk is suitability drift

For AI finance operators, the most practical concept here is suitability drift.

Suitability drift happens when the system’s default reasoning style gradually substitutes for the client’s actual financial preferences. The user thinks they are receiving personalised guidance. The product may even ask a few onboarding questions. But underneath, the model’s own behavioural prior still shapes the recommendation: how it frames risk, how it treats delay, how it prices uncertainty, and how it balances loss against gain.

This can happen without hallucination. That is the unpleasant part. A recommendation can be factually correct, mathematically coherent in a narrow sense, and still unsuitable.

Consider three common use cases.

First, robo-advice. A model that naturally leans toward expected-value logic may underweight the client’s emotional intolerance for drawdowns. In a backtest, this looks efficient. In a real account, it becomes panic selling.

Second, retirement and savings planning. A model with unstable intertemporal preferences may give locally plausible answers to individual questions while failing to maintain consistent horizon logic across planning scenarios. The client does not experience retirement as a sequence of isolated prompts. Sadly, life has state.

Third, investor education. If the model teaches users that rationality means expected-value calculation, it may improve numeracy while flattening behavioural reality. Good education should help users understand their own risk preferences, not shame them into pretending they are a Monte Carlo engine.

The operational response is not to ban LLMs from finance. That would be satisfyingly dramatic and commercially useless. The response is to put preference architecture around them.

A serious AI finance system should include:

Control layer Practical function Why this paper makes it more important
Preference elicitation Capture risk tolerance, loss aversion, liquidity needs, and time horizon explicitly Model defaults are not reliable proxies for client preferences
Consistency checks Test whether recommendations remain coherent across equivalent scenarios Intertemporal answers may imply abnormal $\beta$ or $\delta$ values
Client segmentation Calibrate outputs by jurisdiction, culture, financial literacy, and product type Country-level human responses vary, while LLMs occupy a distinct cluster
Suitability review Require human or rule-based review for high-impact recommendations Fluent answers can still be behaviourally mismatched
Prompt and temperature testing Evaluate outputs across prompt frames and sampling settings The paper uses one temperature and one prompt sequence
Audit logs Store decision paths, preference inputs, and recommendation rationales Behavioural drift must be detectable after the fact

The core product question changes from “Can the model answer financial questions?” to “Whose financial preferences does the model enact when the client has not fully specified their own?”

That question is much harder. Naturally, it is also the one that matters.

What the paper shows, what Cognaptus infers, and what remains uncertain

The paper directly shows that, under its experimental setup, LLMs produce financial decision profiles that differ from most national human profiles. It also shows that these models often align with expected-value reasoning in lottery questions and sometimes violate standard discounting assumptions in time-preference estimation.

Cognaptus infers that AI finance products should treat LLMs as behavioural actors, not just language interfaces. Their defaults can shape advice. Their financial persona should be tested before deployment, monitored after deployment, and constrained where client suitability matters.

What remains uncertain is the origin of the observed profile. The Tanzania proximity is intriguing, but descriptive. It could reflect survey structure, language effects, prompt wording, API behaviour, model training, post-training feedback, aggregation choices, or the geometry of the selected clustering method. The paper does not have access to proprietary training data, so it cannot resolve that causal story.

There is also a measurement boundary. The study uses median responses across 100 stateless API trials, at a fixed temperature of 0.7, with all questions presented in sequence. Different prompts, personas, temperatures, model versions, or advisory contexts could change the results. In fact, that is part of the point: if small context changes alter revealed preferences, then preference governance becomes a production requirement, not a research luxury.

The appendix is not decoration; it tells operators what to test next

The appendix lists the 14 survey questions. This is more useful than it first appears.

The questions are not exotic. They ask about choosing between payments now and later, pricing lotteries, avoiding losses, and accepting risky gains. These are exactly the micro-decisions embedded inside financial advice systems. A bank, broker, insurer, or wealth platform could adapt similar probes into its own model-evaluation suite.

The valuable move is to test sequences, not isolated answers. Ask the model equivalent questions in different forms. Ask for advice under different client profiles. Ask it to recommend, then infer the implied preference parameters. Compare those implied parameters against the client profile, product constraints, and jurisdictional suitability rules.

This is where the paper becomes operational. It suggests a route from prompt testing to behavioural QA.

Not:

“Does the model produce a reasonable answer?”

But:

“Does the model produce a stable and suitable preference structure across financially equivalent situations?”

That second question is more annoying. It is also the difference between a demo and a defensible product.

Homo Silicus needs a suitability file

The phrase “Homo Silicus” is useful because it avoids a lazy binary. LLMs are not merely irrational machines, nor are they clean rational agents. They are trained artefacts that can imitate reasoning, absorb human patterns, calculate neatly, and still produce preference profiles that do not map cleanly onto real clients.

In finance, that hybridity is the risk.

The paper’s contribution is not that LLMs are bad at money. It is more subtle: they have a financial personality. It can be benchmarked. It may look rational in one task and incoherent in another. It may cluster away from most human populations. And it may enter products as an unexamined prior unless operators deliberately constrain it.

That creates a practical rule for AI finance: never deploy a model’s financial advice layer without testing its revealed preferences.

Not its vocabulary. Not its confidence. Not its ability to define Sharpe ratio while sounding like a LinkedIn post that discovered decimals. Its revealed preferences.

Because when Homo Silicus goes to Wall Street, the question is not whether it can talk about money.

The question is whether it understands whose money it is.

Cognaptus: Automate the Present, Incubate the Future.


  1. Orhan Erdem and Ragavi Pobbathi Ashok, “Artificial Finance: How AI Thinks About Money,” arXiv:2507.10933, 2025, https://arxiv.org/pdf/2507.10933↩︎