Signals & Sentiments: How GPT-2 and FinBERT Beat Buy-and-Hold on the S&P 500

TL;DR for operators

A recent arXiv paper tests whether financial-news sentiment from GPT-2 and FinBERT can improve S&P 500 trading when combined with technical indicators and time-series models.¹ The strongest reported strategy, GPT-2 sentiment on Dow Jones news combined with VW MACD, returns 5.77% over the May 10-August 7, 2024 test period. The buy-and-hold benchmark returns -0.696% over the same window.

That is an interesting result. It is not a license to print money, despite what the more excitable corners of finance Twitter would do with a number like 5.77%.

The paper’s more useful finding is architectural: sentiment alone is unstable, technical indicators alone classify poorly, and time-series models are respectable but not enough. The edge appears when noisy but different signals are fused: news tone, price momentum, volume-weighted confirmation, and short-horizon forecasts.

For business use, the practical implication is not “replace portfolio managers with GPT-2,” a sentence that should already make compliance departments reach for coffee. The implication is that trading and wealth platforms can build better decision-support layers: alerts, risk overlays, market-regime diagnostics, and source-aware signal engines.

The uncertainty is material. The backtest is short, assumes zero transaction costs and slippage, treats news items equally, and does not prove robustness across regimes, assets, execution venues, or live deployment. The right takeaway is therefore disciplined: language-model sentiment may be a useful ingredient in systematic trading infrastructure, but the recipe still needs adult supervision.

The 5.77% result is real, but it needs translation

Trading strategies usually become dangerous at exactly the moment they sound simple.

The headline result in the paper is simple enough: a hybrid strategy using GPT-2 sentiment from Dow Jones news and VW MACD returned 5.77%, while buying and holding the S&P 500 over the same period returned -0.696%. On the surface, that looks like a clean win for language-model sentiment.

But the paper is more interesting when read less like a trophy cabinet and more like an evidence stack. The authors do not merely ask whether GPT-2 or FinBERT can classify financial news. They build a pipeline that aligns S&P 500 prices with financial news, converts returns into directional labels, generates daily sentiment signals, calculates technical indicators, adds time-series forecasts, and then simulates trading against a buy-and-hold baseline.

That means the 5.77% result is not a pure LLM result. It is a hybrid-system result.

The language model supplies one kind of information: the tone of financial news. VW MACD supplies another: volume-weighted momentum confirmation. ARIMA, Prophet, and ETS supply a third: short-horizon statistical structure from historical prices. The trading simulation then turns those inputs into buy, sell, or hold decisions.

That distinction matters. If a firm reads the paper as “GPT-2 beat the market,” it will build the wrong product. If it reads the paper as “source-specific sentiment can improve a signal stack when paired with market microstructure proxies and forecast filters,” it may build something useful.

Less catchy, yes. Also less likely to embarrass everyone in an investment committee meeting.

What the paper actually builds

The study uses daily S&P 500 data from August 8, 2019 to August 7, 2024, then evaluates trading performance over the May 10-August 7, 2024 backtest window. The financial news data comes from five sources: Dow Jones, Benzinga, Barron, MarketWatch, and The Wall Street Journal.

The modelling design has four layers.

Layer	What the paper uses	What it contributes	Practical interpretation
Market data	S&P 500 open, high, low, close, adjusted close, volume	Historical price and volume context	The endogenous market state
Sentiment models	GPT-2 and FinBERT	Daily positive, neutral, or negative news tone	The exogenous narrative signal
Technical indicators	MACD, SAR, VW MACD, Dual MACD	Momentum and reversal filters	A market-confirmation layer
Time-series models	ARIMA, Prophet, ETS	Forecasted price/return direction	A statistical smoothing layer

The paper also applies a timing rule to news: articles after market close are linked to the next trading day, and weekend articles are paired with Monday’s return. That is not glamorous, but it is essential. A sentiment signal is only useful if it is aligned to a decision point. Otherwise, the model is just forecasting yesterday with impressive posture.

For sentiment aggregation, multiple article-level predictions on the same day are collapsed into a daily sentiment label using a voting mechanism. This makes the system tradable at daily frequency, but it also simplifies the information structure. A major Federal Reserve story and a routine market wrap can receive equal treatment if both enter the daily vote. That choice becomes important later when we discuss deployment.

The trading simulation starts with **$10,000\ast\ast, no initial shares, and dynamically adjusts holdings based on combined signals. The authors explicitly assume \ast\astzero transaction costs and slippage\ast\ast, which makes the experiment cleaner for comparing strategies but less realistic for live execution.

The evidence stack is not one test pretending to be five

The paper’s empirical pieces have different roles. Reading them as if every table proves the same thing would blur the actual argument.

Evidence item	Likely purpose	What it supports	What it does not prove
Sentiment classification by news source	Main evidence for model-source fit	FinBERT and GPT-2 perform differently depending on outlet	That the highest classifier always produces the best trading return
Time-series model accuracy	Main comparison layer	Prophet and ETS reach 59.65% classification accuracy; ARIMA is weaker at 29.82%	That forecast accuracy alone maximises portfolio return
Technical indicator classification	Diagnostic / ablation-like evidence	Standalone MACD, SAR, VW MACD, and Dual MACD classify poorly, all below 10%	That technical indicators are useless inside a combined strategy
Hybrid trading returns	Main performance evidence	Several combined strategies outperform buy-and-hold in the short test window	That the strategy is robust after costs, slippage, regime shifts, or live execution
GPT-2 versus FinBERT combinations	Sensitivity / model comparison	The best model depends on source and companion indicator	That GPT-2 is universally better than FinBERT

This is the right way to read the paper: as a study of \ast\astsignal interaction\ast\ast, not as a single leaderboard.

The interesting result is not simply that one line in one table is higher than another. It is that the weak components are not weak in the same way. Sentiment is source-dependent. Technical indicators are poor standalone classifiers but useful as filters. Time-series models are smoother and more stable but do not dominate trading performance. The hybrid system works when those imperfections offset each other rather than compound.

FinBERT wins some classification contests; GPT-2 wins the best trade

A reasonable reader might expect FinBERT to dominate. It is trained for financial language; GPT-2 is a general language model. In a clean story, domain specialization wins, everyone nods, and the article ends early.

The paper refuses to be that tidy.

FinBERT reaches its best sentiment classification accuracy on Benzinga at \ast\ast75.56%\ast\ast and performs strongly on Dow Jones at \ast\ast67.69%\ast\ast. GPT-2, meanwhile, performs best on The Wall Street Journal at \ast\ast65.48%\ast\ast and Barron at \ast\ast64.56%\ast\ast. MarketWatch is weak for both models, with FinBERT at \ast\ast32.22%\ast\ast and GPT-2 at \ast\ast34.44%\ast\ast.

So far, that looks like a source-quality and model-fit story. But trading returns complicate it.

The best reported hybrid return is not from the highest sentiment classifier. It is from \ast\astGPT-2 on Dow Jones with VW MACD\ast\ast, returning \ast\ast5.77%\ast\ast. FinBERT’s best reported combination is \ast\astBenzinga with Dual MACD\ast\ast, returning \ast\ast4.64%\ast\ast.

That gap is the paper’s quiet lesson: sentiment classification accuracy and trading profitability are related, but not identical.

A classifier can be accurate in an average sense while still failing to identify the moments that matter most for trading. Conversely, a model with lower aggregate classification accuracy may produce more useful signals around turning points, volatility clusters, or news events that coincide with tradable price movement.

For operators, this means model evaluation cannot stop at sentiment-label accuracy. A trading desk does not monetise F1 scores. It monetises better entry, exit, sizing, hedging, and risk avoidance. The validation metric must match the business objective.

Technical indicators look terrible alone, then useful in company

One of the paper’s sharper findings is almost comically unflattering to standalone technical indicators.

MACD and SAR each achieve \ast\ast7.02%\ast\ast classification accuracy. VW MACD reaches \ast\ast5.26%\ast\ast. Dual MACD reaches \ast\ast3.51%\ast\ast. As standalone directional classifiers in this dataset, these are not exactly covering themselves in quantitative glory.

Yet VW MACD and Dual MACD appear in the strongest trading combinations.

That is not necessarily contradictory. A technical indicator can be poor as a full predictor and still useful as a conditional filter. Its job in the hybrid strategy is not to explain the market by itself. Its job is to confirm, veto, or modulate another signal.

VW MACD is especially relevant because it incorporates volume. If news sentiment turns positive while price momentum is supported by volume, the signal is more credible than sentiment floating alone in narrative space. Markets do not pay investors for having opinions; they pay for being positioned when flows move.

This is where the paper becomes practically useful. The system is not saying:

“The chart knows the truth.”

It is closer to saying:

“The chart may help decide when the news is actionable.”

That is a more useful claim. Also less mystical, which is always welcome in finance.

Time-series models are stable, but stability is not the same as alpha

The time-series results are more respectable than the technical-indicator results. Prophet and ETS both achieve \ast\ast59.65%\ast\ast classification accuracy, while ARIMA reaches \ast\ast29.82%\ast\ast. The figures show predicted returns clustering more tightly around zero than actual returns, with the models offering smoother estimates during volatile periods.

This matters because time-series models can act as stabilisers. They reduce the temptation to overreact to every piece of news. In a hybrid architecture, they can help separate ordinary market noise from directional evidence.

But the trading tables show that forecast accuracy alone does not necessarily produce the best returns. The most profitable strategies arise from combinations of sentiment and technical confirmation rather than pure time-series forecasting.

That is an important business point. Many financial AI products fail because they optimise an intermediate metric and then act surprised when the portfolio result is underwhelming. A clean forecast can still be economically weak if it arrives too late, triggers too often, misses convex payoffs, or fails to avoid drawdowns.

The paper’s best result is therefore not a victory of forecasting over trading. It is a reminder that trading systems need \ast\astdecision metrics\ast\ast, not just prediction metrics.

What the best result probably means

The best-performing setup combines three ideas:

\ast\astDow Jones sentiment from GPT-2\ast\ast supplies a daily narrative signal.
\ast\astVW MACD\ast\ast checks whether price movement is supported by volume-weighted momentum.
\ast\astDynamic allocation\ast\ast allows the strategy to avoid the sharp early-August decline that hurt buy-and-hold.

The mechanism is plausible. News can shift expectations. Price and volume can reveal whether those expectations are being acted upon. A trading rule can then reduce exposure when the combined signal deteriorates.

The paper’s Figure 2, as described by the authors, shows the buy-and-hold benchmark becoming more volatile and suffering a sharp early-August decline, while selected hybrid models maintain steadier upward trends. This is important because the outperformance may come less from heroic upside capture and more from \ast\astloss avoidance\ast\ast during a weak segment of the window.

For business users, that distinction changes the product framing.

A system that avoids bad exposure is not necessarily an alpha engine. It may be a risk-control layer. That is still valuable. In many institutional settings, a tool that reduces drawdown, flags sentiment-momentum divergence, or warns that a benchmark is entering an unstable regime is easier to justify than a fully automated trading bot.

The former assists judgement. The latter asks the legal team to believe in your backtest. One of these conversations is shorter.

Where the paper is strongest

The strongest part of the paper is the empirical contrast across signal types. It does not merely show that one model works. It shows that the performance depends on source, model, indicator, and fusion logic.

That makes the result more business-relevant than a generic “LLMs for finance” demonstration. In practice, financial organisations already have fragmented signal systems: news feeds, analyst notes, charting tools, factor models, risk systems, alerts, execution dashboards. The question is not whether one more model can be added. The question is whether it can be integrated without producing decorative noise.

This paper suggests a useful integration pattern:

Design principle	Evidence in the paper	Business meaning
Do not evaluate sentiment in isolation	Sentiment classification varies strongly by news source	Vendor, source, and corpus selection are part of model design
Do not discard weak standalone indicators too quickly	Technical indicators classify poorly alone but help in top hybrid strategies	A weak predictor can still be a useful filter
Optimise for trading outcome, not only prediction accuracy	Highest sentiment accuracy does not map directly to highest return	Backtests need business-aligned metrics
Treat source-model pairs as products	GPT-2 and FinBERT perform differently across outlets	Sentiment engines should be source-calibrated
Use hybrid signals for risk control	Strong models avoided benchmark volatility in the test period	Decision support may be more realistic than autonomy

This is where operators should pay attention. The paper is not a final system architecture, but it points toward one: sentiment ingestion, source weighting, model-specific calibration, technical confirmation, forecast smoothing, and portfolio-aware evaluation.

What Cognaptus would infer for business use

The paper directly shows that, in one short S&P 500 backtest, some hybrid sentiment-plus-market strategies outperformed buy-and-hold. It also directly shows that GPT-2 and FinBERT perform differently across news sources, and that technical indicators alone were weak classifiers in the tested setup.

Cognaptus would infer three business pathways from that evidence.

First, \ast\asttrading desks can use LLM sentiment as a signal overlay\ast\ast, not as a standalone execution brain. The useful interface is likely an alerting layer: “news sentiment has turned negative while volume-weighted momentum is weakening,” or “Dow Jones sentiment is positive but price-volume confirmation is absent.” That is more actionable than a chatbot saying “the market feels bullish,” which is the kind of sentence that should be escorted out of the building.

Second, \ast\astfintech platforms can package source-aware sentiment into portfolio tools\ast\ast. Retail and adviser platforms already show price charts, analyst ratings, and news feeds. A calibrated sentiment layer could summarise whether trusted sources are becoming more positive or negative, whether that change is confirmed by market data, and whether the portfolio is exposed to the affected index or sector.

Third, \ast\astwealth managers can use hybrid signals for risk conversations\ast\ast. The value is not necessarily in telling clients when to trade the S&P 500. It may be in explaining why the firm is reducing tactical exposure, delaying rebalancing, or increasing hedges. A sentiment-momentum framework can turn vague market anxiety into auditable decision support.

The common thread is governance. The paper’s method is most credible when used to improve human decision workflows, not to remove them.

The boundaries are not footnotes; they change the interpretation

The limitations are not generic academic modesty. They materially change how the result should be used.

The backtest covers \ast\astMay 10 to August 7, 2024\ast\ast. That is a short window. A strategy that performs well over three months may be detecting a regime, exploiting a specific drawdown pattern, or benefiting from parameter luck. Without longer out-of-sample tests, it is impossible to know.

The simulation assumes \ast\astzero transaction costs and slippage\ast\ast. The paper acknowledges this. For a daily S&P 500 index strategy, costs may be manageable depending on instrument and turnover, but they are not imaginary. Execution quality matters most when signals flip frequently or when live markets are stressed.

The sentiment aggregation treats news items equally. That is convenient, but not realistic. A short breaking-news item from a high-impact source may matter more than a longer low-impact article. Author credibility, article placement, source bias, duplication, and timing all matter. Equal weighting is a starting assumption, not an operating model.

The paper focuses only on the \ast\astS&P 500 index\ast\ast. Index-level sentiment is different from single-stock sentiment. Company-level trading introduces earnings calendars, idiosyncratic shocks, sector exposure, liquidity differences, and corporate-event risk. Crypto, commodities, and FX would require different calibration again.

Finally, the study is a backtest, not a live trading deployment. Backtests are useful. They are also famously polite. They do not complain about latency, data revisions, missing feeds, compliance logs, market impact, broker outages, or the small matter of being wrong with real money.

What would make this production-grade

A production version of this approach would need four upgrades.

The first is \ast\astlonger and harsher validation\ast\ast. The model should be tested across rallies, drawdowns, inflation shocks, earnings-heavy periods, quiet markets, and crisis windows. It should also be tested on rolling windows to see whether the best model-source-indicator combination remains stable or rotates.

The second is \ast\astexecution-aware backtesting\ast\ast. Transaction costs, slippage, bid-ask spreads, turnover, instrument choice, and latency should be included. A strategy that survives this step becomes more interesting. A strategy that disappears was not a strategy; it was a spreadsheet having a good day.

The third is \ast\astsource weighting\ast\ast. The paper already shows that news source matters. A real system should not treat all outlets, articles, authors, or timestamps equally. It should learn which sources are predictive for which assets, horizons, and regimes.

The fourth is \ast\astportfolio integration\ast\ast. A signal should be evaluated by what it does to drawdown, volatility, turnover, concentration, exposure, and risk-adjusted return. Raw return is not enough. A 5.77% short-window gain is attractive, but production allocators will ask how it behaves when the world stops cooperating.

The real lesson is fusion discipline

The paper’s result is easy to overstate and too useful to ignore.

It does not prove that GPT-2 and FinBERT can reliably beat the S&P 500. It does not prove that news sentiment is a permanent source of alpha. It does not prove that a three-month backtest should become an autonomous trading product.

It does show something more grounded: financial news sentiment can carry tradable information when aligned carefully with market data, filtered through technical confirmation, and evaluated against a portfolio outcome. It also shows that model choice, source choice, and indicator pairing matter. There is no universal “best LLM sentiment model” hiding in the table. There are combinations that work better under specific assumptions.

That is the operator’s takeaway.

Markets do not reward elegant components. They reward systems that combine imperfect information better than the next participant. In this paper, the winning system listens to the news, checks the chart, watches the forecast, and avoids staying long when the evidence turns hostile.

Not magic. Not market omniscience. Just a slightly more disciplined machine for deciding when words are worth trading on.

That is enough to be interesting.

Notes

\ast\astCognaptus: Automate the Present, Incubate the Future.\ast\ast

Haojie Liu, Zihan Lin, and Randall R. Rojas, “Enhancing Trading Performance Through Sentiment Analysis with Large Language Models: Evidence from the S&P 500,” arXiv:2507.09739, 2025. ↩︎

TL;DR for operators#

The 5.77% result is real, but it needs translation#

What the paper actually builds#

The evidence stack is not one test pretending to be five#

FinBERT wins some classification contests; GPT-2 wins the best trade#

Technical indicators look terrible alone, then useful in company#

Time-series models are stable, but stability is not the same as alpha#

What the best result probably means#

Where the paper is strongest#

What Cognaptus would infer for business use#

The boundaries are not footnotes; they change the interpretation#

What would make this production-grade#

The real lesson is fusion discipline#

Notes#