When Your AI Disagrees with Your Portfolio

TL;DR for operators

An AI investment assistant does not enter every portfolio discussion as a blank analyst. The paper behind this article shows that large language models can carry latent investment preferences: for certain sectors, for larger companies, and for contrarian rather than momentum arguments.¹

The important mechanism is simple and uncomfortable. When buy and sell evidence are balanced, the model’s internal prior can break the tie. When counter-evidence later becomes stronger, that prior does not necessarily disappear. In mixed-evidence settings, the model may latch onto the fragment of evidence that supports its original inclination and discount the stronger opposing side. Splendid. Your “neutral” analyst has discovered confirmation bias and brought it to the investment committee.

For financial AI builders, this changes the audit question. It is not enough to ask whether the model can summarise filings, classify sentiment, or produce plausible recommendations. The sharper question is whether the model preserves the user’s mandate when evidence conflicts with its own learned preferences. A model that repeatedly tilts toward technology, megacaps, or contrarian reasoning may quietly reshape a strategy without ever announcing that it has done so.

The paper’s results should be treated as a bias diagnostic, not a trading backtest. The evidence is synthetic, the stock universe is composed of prominent S&P 500 names, and the price-impact claims are deliberately simplified. Still, the operational lesson is clear: model selection in investment AI is not only a performance decision. It is a governance decision about whose financial instincts are allowed into the workflow.

The real failure mode is not hallucination, but preference leakage

Most discussions of LLM risk in finance start with hallucination. That is understandable. A fabricated earnings number is easy to fear, easy to demonstrate, and satisfyingly embarrassing in a boardroom.

This paper is about a quieter risk. The model may not invent the facts. It may read the evidence, accept the evidence, and still resolve ambiguity in a direction that reflects its own internal view of the market. That is harder to catch because the output can look thoughtful. It may include caveats, ratios, risks, and a tidy conclusion. The problem is not that the model knows nothing. The problem is that it may “know” too much of the wrong kind.

The authors frame this as knowledge conflict: a clash between external context and the model’s parametric knowledge. In investment analysis, conflict is not an edge case. It is Tuesday. A company can show improving margins and weakening guidance. A sector can have strong long-term demand and short-term valuation pressure. A portfolio manager can explicitly want energy exposure while the model has learned to admire technology names because they dominate financial discourse. The test is not whether the model can process one clean signal. The test is what happens when signals disagree.

The paper’s mechanism-first contribution is valuable because it does not merely ask, “Which stocks does the model like?” It asks how preference behaves under pressure. First, expose the prior. Then challenge it. Then observe whether the model updates or digs in.

The three-stage test turns ambiguity into a bias detector

The study uses a controlled three-stage experimental framework.

First, the authors generate synthetic buy and sell evidence for each stock. They use Gemini-2.5-Pro as the evidence generator, separate from the evaluated models, and constrain each piece of evidence to express a fixed expected price impact. The evidence is deliberately fictional but plausible. That matters: the experiment is not trying to predict whether Microsoft or Exxon will outperform next week. It is trying to create a clean conflict chamber where evidence can be balanced or imbalanced without the mess of real-time market data.

Second, the evaluated model receives balanced evidence. Buy and sell arguments are presented in equal proportion and with equal intensity. In principle, a neutral model should be uncertain or at least evenly split across repeated trials. In practice, the choice under equilibrium becomes revealing. If the external evidence is symmetrical, the model’s decision has to be coming from somewhere else. The paper treats that “somewhere else” as latent bias.

The authors quantify this with a bias score, ranging from -1 to 1, based on repeated buy and sell decisions. A score near 1 means the model tends to choose buy; a score near -1 means it tends to choose sell. The experiment is repeated across a curated set of 427 stocks that remained in the S&P 500 over the previous five years, with repeated trials and randomized evidence order to reduce positional artefacts.

Third, the authors test persistence. Once a model shows a bias toward a group, the prompt is made imbalanced against that bias. The counter-evidence is strengthened either by volume or by intensity. If the model updates, the decision should flip. If it does not, the bias has become operationally consequential.

The design is not a live portfolio simulator. It is closer to a stress test for mandate obedience.

Test component	Likely purpose	What it supports	What it does not prove
Balanced buy/sell evidence	Main evidence	Reveals how models break ties when external signals are symmetrical	That the chosen stock would perform better or worse
Sector, size, and momentum partitions	Main evidence	Identifies where model preferences cluster	That these are the only relevant financial biases
More counter-evidence by volume	Robustness / sensitivity test	Tests whether models revise when opposing evidence becomes the majority	That real market evidence is naturally countable this way
Stronger counter-evidence by intensity	Robustness / sensitivity test	Tests whether models respond to materially stronger opposing claims	That “5% versus 10%” captures real investment conviction
Entropy comparison	Exploratory extension	Connects bias strength with model uncertainty under conflict	That entropy alone is a complete confidence or risk metric

The first reveal: models have investment tastes

The sector results are the first sign that “neutral analysis” is doing a little too much work. Most evaluated models show a positive bias toward technology stocks. DeepSeek-V3, Gemini-2.5-flash, Qwen3-235B, Mistral-Small, and GPT-4.1 all record technology as their highest-scoring sector. Llama4-Scout is the main exception, with energy slightly ahead of technology.

The magnitude varies sharply. Llama4-Scout and DeepSeek-V3 show broadly high positive scores across sectors, which suggests a general buy tendency with sectoral differences inside it. GPT-4.1 is much flatter and even negative in several sectors, including consumer defensive at the low end. That difference matters. “LLMs are biased” is true but not operationally sufficient. Different models carry different financial reflexes.

The size result is more consistent. Bias scores generally decline as market capitalisation decreases. The largest-cap quantile receives the strongest positive bias across all evaluated models. GPT-4.1 again behaves differently in level: its largest-cap score is mildly positive, while smaller quantiles become negative. But the pattern still points downward as company size falls.

This is not mysterious. The authors attribute the pattern to popularity effects. Large, famous companies and dominant sectors likely appear more often and in richer contexts in training data. The model does not need to “believe” Apple is superior in any human sense. It has simply absorbed a world where large, visible firms are discussed constantly, often with richer narratives and more analyst-style justification. In finance, narrative density can masquerade as conviction.

Momentum bias is tested differently because it is not tied to a fixed company group. The authors create balanced momentum and contrarian arguments, then see which view wins. Across all evaluated models, the contrarian view wins more often than the momentum view, with the difference statistically significant. Gemini-2.5-flash is the least lopsided, but still shows a significant contrarian preference.

That finding is especially interesting because it sounds sophisticated. Contrarian reasoning has a respectable investment pedigree. Mean reversion is not nonsense. The danger is that a model can turn a legitimate investment style into a default reflex. If a firm wants a momentum screen, and the AI keeps quietly reframing underperformance as opportunity, the tool is no longer supporting the mandate. It is editing it.

The second reveal: counter-evidence works until the model finds one friendly excuse

The paper becomes more useful when it moves from “models have preferences” to “preferences persist under challenge.”

In the evidence-volume test, the authors present more counter-evidence than supporting evidence. When only counter-evidence is shown, models generally flip at rates near 1.0. That is the optimistic part. These systems are not completely immune to external information. Give them a clean one-sided prompt and they often follow it.

Then comes the important part. Once any supporting evidence is mixed back into the prompt, even when counter-evidence remains the majority, flip rates drop sharply. The model can now choose the evidence that agrees with its prior and treat that as justification. The problem is not ignorance of the counter-evidence. It is selective uptake under conflict.

The effect is strongest in models with stronger initial bias. Llama4-Scout and DeepSeek-V3, which show high sector bias, are less willing to reverse under mixed evidence. GPT-4.1 and Gemini-2.5-flash, which show lower or more balanced bias profiles, remain more adaptable, though not perfectly so. This is the practical link: latent bias is not merely descriptive. It predicts rigidity.

The evidence-intensity test makes the point harder to dismiss. Here, the authors keep the amount of supporting and counter-evidence equal but make the counter-evidence stronger. As the intensity gap rises, flip rates improve. That is sensible. But even when counter-evidence is much stronger, most models remain below a 60% flip rate. Gemini-2.5-flash is the most responsive; GPT-4.1 also improves meaningfully. Qwen3-235B, Mistral-Small, Llama4-Scout, and DeepSeek-V3 remain comparatively stubborn.

For a portfolio workflow, this is where the risk becomes concrete. A model that can be corrected by clean counter-evidence may still fail in the actual operating environment, because investment memos rarely arrive clean. There is almost always one supportive datapoint available somewhere. A stubborn model only needs one.

Confidence can be the prior talking loudly

The entropy analysis adds a useful twist. The authors compare DeepSeek-V3, treated as a higher-bias model, with GPT-4.1, treated as a lower-bias model. Under balanced prompts, GPT-4.1 shows higher entropy, meaning greater uncertainty. DeepSeek-V3 shows lower entropy, meaning greater confidence. At first glance, that looks flattering for DeepSeek-V3. It appears decisive.

But when the prompt becomes imbalanced against the model’s prior, the pattern reverses. DeepSeek-V3’s entropy rises, while GPT-4.1’s falls. The lower-bias model becomes more confident when evidence tilts clearly. The higher-bias model becomes more conflicted when its internal preference is challenged.

This is an elegant result because it attacks a common user mistake: confusing fluency and confidence with objectivity. In a balanced evidence setting, a confident answer may not mean the model has found a superior investment rationale. It may mean the model’s prior had an easy path through the ambiguity. When stronger opposing evidence appears, that same prior becomes a source of friction.

For builders, this suggests confidence monitoring should not be interpreted in isolation. A low-entropy answer under balanced conflict may be a warning sign, not a comfort blanket. Yes, the model sounds sure. So does every committee member who entered the meeting with a favourite stock.

The business problem is mandate drift, not model personality

It is tempting to describe these findings as if each model has a charming little investment personality. One likes technology. One prefers energy. Several are contrarian. Very cute. Also a terrible governance framework.

The operational issue is mandate drift. If a client asks for a small-cap momentum screen and the model repeatedly favours large-cap contrarian narratives, the system has not merely produced a different opinion. It has moved away from the requested investment process. In regulated or client-facing contexts, that distinction matters.

The paper directly shows that evaluated LLMs can exhibit statistically significant biases across sector, size, and investment style, and that stronger biases can persist when conflicting evidence is introduced. Cognaptus infers the business implication: AI investment systems need mandate-alignment tests that include conflict scenarios. Testing only clean prompts, factual recall, or historical benchmark tasks will miss the failure mode.

A useful audit would include at least four layers:

Audit layer	Question to test	Example failure
Preference map	What does the model favour when evidence is balanced?	Persistent tilt toward technology or large caps
Conflict response	Does the model update when counter-evidence becomes stronger?	Selects one supportive datapoint and ignores the majority
Mandate alignment	Does the model preserve the user’s stated strategy?	Converts momentum brief into contrarian recommendation
Confidence behaviour	Does certainty rise when evidence strengthens, or when priors are comfortable?	Confident under balanced ambiguity, hesitant under clear opposition

This is not only about choosing “the least biased model.” A model with a known bias can sometimes be managed. A model with an unknown bias becomes part of the portfolio construction process without being recognised as such. That is the governance problem wearing a UX costume.

What to change in financial AI evaluation

The study points toward a more serious evaluation stack for LLM-based investment systems.

First, test with balanced conflicts. A model that performs well on one-sided prompts may still behave poorly when evidence is mixed. Balanced prompts are useful because they reveal tie-breaking behaviour. In real workflows, tie-breaking is where hidden priors become decisions.

Second, test against the strategy, not just the answer. If the system is designed to support factor investing, ESG screening, income portfolios, short-term trading, or sector rotation, the model should be evaluated on whether it preserves that framework under ambiguity. The wrong failure mode is not always “bad stock pick.” Sometimes it is “right-sounding answer to the wrong mandate.”

Third, log decision flips. When evidence changes, the model should sometimes change its mind. A system that never revises is not disciplined; it is merely stubborn with better punctuation. Flip-rate diagnostics can help identify whether the model responds to new information or selectively absorbs only favourable updates.

Fourth, separate recommendation generation from bias monitoring. One model can produce analysis while another process audits whether the output systematically favours certain sectors, sizes, geographies, or styles. In financial AI, the model should not be the only witness to its own objectivity. We tried self-regulation in other domains. It was adorable.

The boundaries are sharp, and they matter

The paper is useful precisely because it is controlled. That also limits what can be claimed.

The evidence is synthetic. The authors intentionally generate fictional but plausible buy and sell arguments with simplified price-impact claims. This makes the experiment cleaner, but it does not reproduce the full complexity of filings, analyst revisions, macro shocks, liquidity conditions, or actual portfolio constraints.

The stock universe is prominent. The 427 stocks are continuously listed S&P 500 constituents over a five-year window. That helps ensure models are likely to have internal knowledge about them, but it also means the study is not directly testing obscure small caps, non-US equities, private assets, crypto tokens, or emerging-market securities.

The analysis is static. Model versions, training data, alignment methods, and retrieval systems change. A bias map from one model snapshot should not be treated as permanent. It should be treated as something to measure repeatedly.

The design may also be easier for some reasoning models to detect or route around, especially when the imbalance is obvious. The authors note this limitation. In production, prompts are messier and the model may not know it is being tested. That could make behaviour better or worse. Anyone claiming certainty here is selling a dashboard.

So the right interpretation is disciplined: this is not evidence that LLMs cannot support investment analysis. It is evidence that their support needs structured bias testing before it is trusted near strategy, suitability, or client-facing recommendation workflows.

The uncomfortable question: whose view is your AI applying?

The paper’s title says the quiet part clearly: your AI may not be using your view. It may be using its own.

That does not make LLMs useless for finance. It makes them more like junior analysts than calculators. They can process information, organise arguments, and surface considerations quickly. But they can also import assumptions. They can over-favour familiar names. They can resolve ambiguity using learned market narratives. They can sound confident exactly when uncertainty would be healthier.

The lesson for operators is not to ban financial AI from decision workflows. The lesson is to stop pretending the model is a neutral pipe between evidence and recommendation. It is an interpretive layer. Interpretive layers need audits.

A good investment AI stack should therefore ask three questions before deployment: what does this model prefer when the evidence is balanced, how easily does it update when stronger counter-evidence appears, and does its behaviour remain aligned with the mandate it is supposed to serve?

If the answer is unclear, the model is not your portfolio assistant yet. It is a very articulate committee member with undisclosed positions.

Cognaptus: Automate the Present, Incubate the Future.

Hoyoung Lee, Junhyuk Seo, Suhwan Park, Junhyeong Lee, Wonbin Ahn, Chanyeol Choi, Alejandro Lopez-Lira, and Yongjae Lee, “Your AI, Not Your View: The Bias of LLMs in Investment Analysis,” arXiv:2507.20957, 2025. https://arxiv.org/abs/2507.20957 ↩︎

TL;DR for operators#

The real failure mode is not hallucination, but preference leakage#

The three-stage test turns ambiguity into a bias detector#

The first reveal: models have investment tastes#

The second reveal: counter-evidence works until the model finds one friendly excuse#

Confidence can be the prior talking loudly#

The business problem is mandate drift, not model personality#

What to change in financial AI evaluation#

The boundaries are sharp, and they matter#

The uncomfortable question: whose view is your AI applying?#