Markets have a talent for embarrassing elegant models.

A factor model says a company looks cheap, profitable, revised upward, less volatile, or attractively positioned. A news headline says the company just changed guidance, delayed a merger, won a contract, received a regulatory opinion, or did something else that refuses to fit politely into a spreadsheet. The obvious modern temptation is to feed both into a large language model, add some attention, and let the machine discover alpha. Naturally, because this is finance, the obvious temptation is not quite correct.

The paper behind this article, Exploring the Synergy of Quantitative Factors and Newsflow Representations from Large Language Models for Stock Return Prediction, studies exactly that uncomfortable boundary between numerical factors and textual newsflow.1 It asks a practical question rather than a theatrical one: when stock-selection models already have traditional quantitative factors, does LLM-encoded news actually add predictive value, and how should the two signals be combined?

The answer is useful because it is not a slogan. LLM news representations can help. They can also dilute factor information, underperform simpler models, and become less useful when fine-tuned in the wrong market context. The winning idea is not “use more AI.” It is “separate what is stable from what is opportunistic, then let the model decide how much of each to trust.”

A small mercy. We take those where we find them.

The real problem is signal dilution, not text understanding

Most AI-for-investing discussions start with whether the model can “understand” financial text. That is the wrong first question here.

The paper assumes that news can be turned into representations using an encoder-only LLM, DeBERTa. The model receives stock-specific newsflow from a look-back window, aggregates token representations into a news vector, and combines this with roughly 200 quantitative factors covering categories such as quality, size, volume, growth, momentum, risk, revisions, and valuation.

The more important problem is what happens after the text is represented.

Quantitative factors are not just arbitrary numbers. They are compressed claims about companies: profitability, valuation, momentum, analyst revision, balance-sheet quality, risk, and related characteristics. News headlines often overlap with those claims. An earnings headline may be partly captured by growth and revision factors. A downgrade may be partly captured by momentum or analyst revision data. A buyback headline may overlap with valuation and capital allocation signals.

So the first mechanism is simple:

News is useful only when it adds information not already captured by the factors.

That is different from saying news is useful when it is semantically rich. A headline can be perfectly meaningful and still useless for stock selection if the same information has already flowed into prices, analyst estimates, or factor inputs. This is where many LLM-finance demos quietly lose the plot. They confuse linguistic relevance with investment relevance.

The paper’s model comparison is built around this distinction. It tests whether LLM-derived newsflow representations improve return prediction when fused with classic factor data, and whether more sophisticated fusion machinery actually helps.

Spoiler, because finance rarely rewards decoration: the simple fusion method often performs better than the fancier ones.

Three ways to fuse factors and news, and one reason simplicity wins

The paper tests three representation-level fusion methods.

The first is representation combination. It concatenates factor features and LLM-generated news representations, then passes them through a dense layer. This is the least glamorous method. It preserves the two modalities as separate inputs before learning interactions across them.

The second is representation summation. It projects both modalities into a shared representation space and sums them. This assumes the two signals can be meaningfully aligned into a common vector form.

The third is attentive representation. It adds modality-level weights, allowing the model to adjust the relative importance of factors and news before producing a unified representation.

At first glance, attention sounds like the natural winner. It can weight modalities. It looks sophisticated. It has the correct aesthetic for a slide deck.

The results do not obey the slide deck.

In the North American universe, Fusion Combination achieves a 32.43% annualized return in the long-only portfolio and 28.41% in the long-short portfolio. Fusion Summation posts 20.42% and 16.13%. Fusion Attention posts 21.73% and 19.59%. In the Emerging Markets universe, Fusion Combination and Fusion Summation are similar in long-only performance, but Fusion Combination remains stronger than Fusion Attention in long-short performance. In the European universe, Fusion Combination also leads among the fusion methods, with 19.60% annualized long-only return and 32.51% long-short return.

The mechanism suggested by the paper is that summation and attention may compress heterogeneous information too aggressively. Factors and news are not merely two views of the same object. They have different structures, timing, noise profiles, and degrees of redundancy. Forcing them into a common representational space can obscure useful differences.

That interpretation matters operationally. In investment systems, the question is not whether multimodal AI is possible. It is whether the fusion layer preserves the useful structure of each signal long enough for the model to learn from it.

Design choice What it tries to do What the evidence suggests Business reading
Concatenate factors and news representations Preserve both modalities and learn interactions Generally strongest fusion method across tested universes A simple integration layer may beat expensive architectural cleverness
Sum projected representations Force both modalities into a shared space Often weaker portfolio performance Alignment can become information compression
Add attention weights Adaptively weight modalities before prediction Does not consistently outperform simpler fusion Attention is not a substitute for useful signal separation

This is the first practical lesson: if the data streams have different economic meaning, do not rush to make them look alike. Some information is useful precisely because it has not been homogenised yet.

The mixture model exists because fusion can damage the factor signal

Fusion Combination performs well, but the paper does not stop there. It notices a deeper failure mode.

In some cases, fusion improves the long-only portfolio but weakens the long-short portfolio. That is not a small technical detail. Long-only performance mainly asks whether the model can identify attractive stocks. Long-short performance asks whether it can identify both attractive and unattractive stocks. A model that buys well but shorts badly is only half a stock-selection model, though admittedly the half people prefer to talk about.

The paper’s North American illustrative comparison shows this tension. Fusion can add value when news provides complementary information, but it can also dilute factor information when the news is redundant, irrelevant, or already priced. This dilution is especially damaging when the model needs to identify the bottom decile for shorting.

So the authors introduce a mixture model. Instead of producing only one fused representation and one prediction, the system keeps two prediction routes:

  1. a factor-based component, trained on quantitative factors alone;
  2. a fusion-based component, trained on the combined factor-news representation.

A learned probability mechanism then decides how much weight to assign to each prediction for each instance.

Conceptually, this is closer to an investment committee than a blender. One analyst uses the factor model. Another analyst uses factors plus news. A gatekeeper decides whose forecast deserves more influence for this stock at this time. Less poetic, more useful.

The model’s prediction can be understood as:

$$ \hat{r} = p_f \hat{r}_f + p_u \hat{r}_u $$

where $\hat{r}_f$ is the factor-based prediction, $\hat{r}_u$ is the fusion-based prediction, and $p_f$ and $p_u$ are adaptive weights assigned to each component.

The idea is attractive because it allows news to matter only when it earns its keep. That is the correct standard. Markets are not obliged to reward a headline just because an embedding model found it linguistically interesting.

Conventional mixture training breaks the components it is supposed to combine

There is a catch. A conventional end-to-end mixture model can train unstably.

The paper’s diagnosis is technical but important. When the mixture model is trained conventionally, the prediction components and the probability weighting mechanism are updated together through the same loss. The gradient for each component becomes entangled with the mixture probabilities and the residuals produced by all components. If one component is inaccurate, it can inflate the residual used to train the others. If the mixture probabilities fluctuate, they add more variance to the parameter updates.

In plain terms: the model tries to learn the experts and the judge at the same time, and the judge keeps changing the rules while the experts are still learning to speak.

The paper formalises this through a proposition on entangled gradient variance. Under conventional training, the variance of the stochastic gradient for a component contains terms involving both the component’s own gradient signal and the variance of its mixture probability. Under standalone training, that additional probability-entanglement term is absent.

This is not merely decorative theory. The appendix extends the training-error comparison across North American, Emerging Markets, and European universes. The curves show that conventional mixture training produces slower and less stable convergence of the prediction components, while decoupled training yields more stable component learning.

The fix is decoupled training.

The model trains the factor and fusion components independently so each can realise its own predictive capacity. Separately, it trains the weighting distribution to match which component actually performs better on each training instance. The target distribution assigns higher probability to the component with lower prediction error, scaled through a temperature parameter. At test time, since the true return is unknown, the trained probability model infers which component should receive more weight from the observed factor and news inputs.

This is the paper’s most important mechanism. The mixture model is not valuable simply because it combines two predictions. It is valuable because the training method avoids damaging the components before combining them.

Evidence item Likely purpose What it supports What it does not prove
Fusion method comparison Main evidence Simple representation combination is often stronger than summation or attention That concatenation is universally optimal
Conventional vs decoupled mixture training Main evidence plus ablation-like mechanism test Joint mixture training can destabilise components; decoupling improves robustness That this exact decoupling method is the only solution
Gradient variance proposition Theoretical support Instability has a plausible optimisation mechanism That empirical performance is fully explained by gradient variance
Decile-return charts Diagnostic evidence Long-short performance depends on correct ranking at both top and bottom deciles That annualized returns will survive costs and capacity constraints
Fine-tuning comparison Sensitivity test LLM fine-tuning is market-dependent, not automatically beneficial That fine-tuning should generally be avoided

This table matters because not every result in the paper has the same evidentiary role. The portfolio tables are the main evidence. The training curves explain why the mixture model needs special handling. The fine-tuning experiments are sensitivity tests, not a second thesis. The qualitative news examples are interpretive diagnostics, not proof that the model “understands” markets like a portfolio manager. Thankfully.

The portfolio results say “adaptive layer,” not “LLM takeover”

The clearest empirical pattern is that different universes reward different combinations of factors and news.

In the North American universe, Fusion Combination performs best in the long-only portfolio, with 32.43% annualized return and a Sharpe ratio of 1.00. Mixture Decoupled is lower on long-only return at 28.21%, but it performs best in the long-short portfolio, with 33.77% annualized return and a Sharpe ratio of 1.78. The interpretation is that fusion helps identify strong long candidates, but the decoupled mixture is better at managing the bottom-decile short leg.

In the Emerging Markets universe, Factors Alone is already powerful: 17.14% long-only return and 42.17% long-short return. News Alone performs badly, with negative long-only and long-short returns. Fusion Combination improves over News Alone but lags Factors Alone. Mixture Decoupled is more balanced: it reaches 18.50% long-only return, the best in that table, and 42.07% long-short return, almost matching Factors Alone.

In the European universe, Fusion Combination leads long-only and long-short returns among the methods shown: 19.60% and 32.51%. Mixture Decoupled is close behind, with 18.32% and 30.43%, while posting the highest IC at 0.053.

The important pattern is not that one architecture wins every cell. It does not. The important pattern is that the decoupled mixture is relatively robust across universes and portfolio types. It does not always produce the top number, but it often avoids the worst consequence of naive fusion: losing the stable factor signal when news does not add much.

That is exactly what a business system should want from newsflow AI. Not a heroic model that claims to discover hidden truth in every headline. A disciplined layer that can say, in effect, “this text helps here, but not there.”

Low prediction error is not the same as good stock selection

One of the paper’s most useful observations is that prediction metrics and portfolio outcomes do not always agree.

The authors report Mean Absolute Percentage Error (MAPE) and Information Coefficient (IC). MAPE measures average prediction error. IC measures rank correlation between predicted and actual returns. For stock selection, ranking is often more important than point accuracy, because the portfolio is built by sorting stocks and selecting top and bottom deciles.

The paper gives a clear example. In the North American results, Fusion Attention has a relatively low MAPE of 1.302, better than Fusion Combination’s 1.402. Yet Fusion Attention performs worse as a portfolio model. Fusion Combination has a higher IC of 0.031 and much stronger long-only and long-short returns.

The mechanism is straightforward. A stock-selection model does not need to forecast every return perfectly. It needs to rank the extremes correctly. A small error near the top-decile boundary can matter more than a larger error in the middle of the distribution. MAPE is polite to the wrong mistakes.

This is a practical warning for investment AI procurement. A vendor can show attractive prediction-error metrics while failing at the actual portfolio task. The business test is not “does the model predict returns with low average error?” It is “does the model improve the rank ordering that drives the portfolio construction rule?”

For a quant platform, this means model evaluation should follow the decision pipeline:

  1. generate predictions;
  2. rank the universe;
  3. build top-decile or long-short portfolios;
  4. inspect decile spreads;
  5. compare against factor-only and news-only baselines;
  6. test sensitivity to market universe and fine-tuning;
  7. only then discuss deployment.

Doing this in the opposite order is how one gets a beautiful model governance document attached to a mediocre strategy. A familiar genre.

Fine-tuning the LLM is not free alpha with extra steps

The paper also tests whether enabling LoRA fine-tuning of DeBERTa improves performance. The result is deliberately inconvenient: sometimes yes, sometimes no.

In North America, fine-tuning helps the News Alone model substantially, raising its long-only annualized return from 20.96% to 27.98% and its long-short return from 1.08% to 14.23%. But for several multimodal methods, fine-tuning weakens performance. Fusion Combination drops from 32.43% to 28.61% in long-only return. Mixture Decoupled drops from 33.77% to 30.66% in long-short return, although its Sharpe ratio remains strong.

In Emerging Markets, fine-tuning helps some methods. News Alone moves from negative long-short performance to positive. Mixture Decoupled improves slightly in long-short return from 42.07% to 43.49%. Fusion Summation also improves. In Europe, fine-tuning has mixed but generally modest effects: Fusion Combination’s long-only return rises from 19.60% to 20.15%, while its long-short return is nearly unchanged.

The authors interpret this through market efficiency and data characteristics. In a highly efficient market, public news may be quickly incorporated into prices and reflected in factors. Fine-tuning the LLM may then overemphasise noisy or already-priced information. In more heterogeneous markets, the news may contain more unpriced or unevenly processed information, making fine-tuning more useful.

The business inference is not “never fine-tune.” It is worse for anyone hoping for a rule of thumb: fine-tuning should be treated as a market-specific sensitivity setting.

For production teams, that implies a control framework:

Deployment question Practical test
Does news add incremental signal beyond factors? Compare factor-only, news-only, fusion, and mixture models by universe
Does fine-tuning help or overfit? Run side-by-side LoRA and frozen-LLM tests on the same backtest window
Is the model ranking correctly? Prioritise IC, decile spreads, and portfolio outcomes over average prediction error
Is the short leg working? Inspect 0th-decile returns separately from 9th-decile returns
Is the result stable across markets? Test NA, EM, EU, or equivalent regional universes separately
Is news redundant? Audit headline categories around rebalancing dates where fusion underperforms

The last point is especially valuable. The paper’s qualitative appendix examples show that news around earnings, ratings, sales, and guidance often overlaps with factor information. News about partnerships, trial results, acquisitions, regulatory changes, production updates, or contracts may provide more complementary information. This is not a universal taxonomy, but it is a useful diagnostic habit.

The question is not “is the news positive?” The question is “is the news new relative to the rest of the model?”

The business value is signal governance, not headline magic

For quant funds, wealthtech platforms, and investment-data vendors, this paper has a practical pathway.

Start with vendor-supplied company headlines. Convert them into LLM representations. Combine those representations with structured factors. Test simple fusion first. Then test a decoupled mixture that protects the factor model from noisy or redundant news. Evaluate by rank-based metrics and portfolio backtests, not just prediction error. Deploy only in universes where news adds measurable incremental value.

That pathway is less glamorous than “LLMs reinvent investing.” It is also more plausible.

The direct business implications are threefold.

First, LLM newsflow should be an adaptive signal layer, not a replacement for factor infrastructure. Factors remain powerful, especially where news is sparse, noisy, redundant, or quickly priced. In the Emerging Markets table, Factors Alone is extremely competitive, and News Alone is poor without fine-tuning. Throwing away factors because text feels more intelligent would be an expensive form of theatre.

Second, architecture cost should be justified by incremental portfolio value. The paper does not show that more complex fusion is better. It shows that a relatively simple concatenation-based method can outperform summation and attention. The implication for product teams is clear: do not begin with the most impressive neural architecture. Begin with the cleanest baseline that preserves economic structure.

Third, model governance should include modality relevance monitoring. A multimodal investment model should be monitored for when news helps, when it hurts, and when it becomes redundant. That means tracking not only aggregate performance but also decile behaviour, IC, universe-level differences, headline categories, and the mixture weights assigned to factor versus fusion components.

The commercial opportunity is therefore not simply to sell “LLM alpha.” It is to sell better signal triage: knowing when narrative data deserves capital allocation weight and when it should be politely ignored.

The boundaries are narrow, and that is useful

The paper’s evidence is meaningful, but its boundaries matter.

The experiments use commercial company-level headline data, not full articles. The authors explicitly note that full article content is noisier and computationally heavier, likely requiring relevance filtering. News look-back windows differ by universe: one week for North America and Europe, one month for Emerging Markets because of lower news coverage.

The LLM is DeBERTa, an encoder-only model. Fine-tuning uses LoRA with rank 4 across linear layers. Larger decoder-only models are not the focus, partly because of potential memorisation concerns when evaluating post-2022 market data. The training and validation data span 2003 to 2022, while testing covers 2023 and 2024. Models are evaluated over the test period without rolling retraining.

The portfolios are equal-weighted, monthly rebalanced, and based on top and bottom predicted deciles. The paper reports annualized returns, Sharpe ratios, MAPE, IC, cumulative returns, and decile returns. It does not establish live tradable alpha after transaction costs, financing costs, borrow availability, shorting constraints, market impact, capacity limits, tax effects, compliance restrictions, or the inevitable decay that begins once too many people read the same paper.

These boundaries do not weaken the paper. They prevent the wrong reading of it.

The right reading is not that LLMs can now predict stocks. The right reading is that LLM-derived news representations can be useful when integrated with factor models carefully, evaluated through portfolio-relevant metrics, and protected from over-weighting noisy text.

That is a subtler conclusion. Also, inconveniently, a more valuable one.

A better mental model for LLMs in quant investing

The paper reframes LLMs in quant investing from “forecasting oracle” to “conditional information processor.”

The LLM does not replace the investment model. It transforms newsflow into representations. The fusion model tests whether those representations add signal to structured factors. The mixture model learns when the fused signal deserves weight. The decoupled training scheme prevents the adaptive mechanism from damaging the specialist predictors. The backtest then checks whether the resulting rankings actually support stock selection.

That chain is the important contribution.

It also gives investment organisations a more disciplined way to experiment with AI. Instead of asking whether LLMs “work” in investing, ask narrower questions:

Does text add incremental information beyond factors in this universe?

Does the architecture preserve factor strength when text is weak?

Does the model rank extremes correctly, or merely reduce average forecast error?

Does fine-tuning specialise the news representation or overfit already-priced noise?

Does the short leg improve, or is the model only good at finding longs?

Those questions are less exciting than a keynote. Fortunately, portfolios are not marked to keynote enthusiasm.

The broader lesson is that numbers and narratives do not need to be forced into one grand representation. Sometimes the better design is to let each speak separately, then learn when to listen.

Cognaptus: Automate the Present, Incubate the Future.


  1. Tian Guo and Emmanuel Hauptmann, “Exploring the Synergy of Quantitative Factors and Newsflow Representations from Large Language Models for Stock Return Prediction,” arXiv:2510.15691, 2025, https://arxiv.org/pdf/2510.15691↩︎