TL;DR for operators

A new paper by Mateusz Kmak and colleagues asks a very practical question: can Reddit sentiment, especially when annotated with ChatGPT and fed into a fine-tuned Financial-RoBERTa model, predict meme-stock prices?1 The short answer is: not very well. Which is awkward, because the whole exercise starts from the obvious temptation that if Reddit can help move a stock, then Reddit sentiment should help forecast it. Markets, naturally, have declined to be that tidy.

The paper studies r/wallstreetbets activity around GameStop and AMC from 4 January to 31 March 2021. It compares several sentiment approaches: TextBlob, an out-of-the-box Financial-RoBERTa model, a ChatGPT-labelled and Reddit-adapted Financial-RoBERTa model, and a simple emoji counter. It then tests these against daily stock prices using Pearson correlation, Kendall Tau, and Granger causality-style predictability tests. The authors also compare sentiment against simpler attention variables: comment volume and Google Trends.

The operational lesson is not “ignore Reddit”. It is more specific. If the business question is price prediction, textual sentiment labels are weak standalone signals in this setup. If the business question is volatility monitoring, crowd mobilisation, retail-flow risk, or narrative surveillance, attention measures look more useful. Comment volume and search interest show stronger links to price dynamics than the sentiment classifiers do.

For alternative-data vendors, this is a product-design warning. A sophisticated social-sentiment model may sound premium, but the higher-value dashboard may be one that tracks attention, acceleration, coordination, emoji density, ticker concentration, and feedback loops between price moves and online activity. Less glamorous, more useful. A cruel arrangement, but finance has always had taste issues.

The boundary is important. This is not a universal proof that Reddit sentiment never predicts stocks. The study covers two meme stocks, one exceptional market episode, one subreddit, daily aggregation, and correlation/Granger tests rather than a validated trading strategy. But within that boundary, the paper is a useful antidote to a very common myth: better NLP does not automatically turn online shouting into alpha.

The real comparison is not Reddit versus Wall Street

The familiar story of the 2021 meme-stock episode is usually told as a collision between retail investors and institutional finance. Reddit discovers collective power. Hedge funds discover pain. Journalists discover the word “stonks”. Everyone discovers that a trading app can become market infrastructure by accident.

That story is not wrong, but it is too broad for this paper. Kmak et al. ask a narrower and more operational question: when a social-media crowd appears to matter, which social signal actually carries useful information?

That distinction matters. Many market-intelligence systems collapse social media into a single variable called “sentiment”, as if the crowd’s emotional valence were the missing ingredient. Positive sentiment means buy pressure. Negative sentiment means sell pressure. Neutral sentiment means the algorithm goes for coffee.

The paper tests whether that assumption holds up when the language is not normal finance language but r/wallstreetbets language: rockets, diamond hands, apes, sarcasm, rallying cries, irony, and emoji-heavy coordination. It then compares the fancy version of the task against the brutally simple version: how many people are talking, and how many people are searching?

That is the more interesting contest:

Signal type What it tries to measure Why it sounds useful What the paper broadly finds
Text sentiment Whether posts sound positive, neutral, or negative toward the stock It seems closest to investor intention Weak correlation with price and limited predictive evidence
Fine-tuned Reddit sentiment Whether a model adapted to Reddit slang and emojis can improve on generic sentiment It should understand the local dialect better Some isolated signals, but not enough to outperform simpler attention metrics
Emoji count Intensity of non-textual expression It may capture meme-crowd energy better than words Mixed: weak or negative for GME, stronger for AMC
Comment volume How much discussion is happening It measures crowd mobilisation directly Stronger relationship with price than sentiment labels
Google Trends Wider public attention beyond subreddit discussion It captures broader awareness and search demand Strong relationship, especially for AMC

The table is the article in miniature. The market signal appears less like “the crowd is happy” and more like “the crowd has arrived”.

What the paper actually builds: three sentiment readers and one very rude baseline

The dataset covers roughly seven million r/wallstreetbets posts/comments from early January through the end of March 2021. The focus stocks are GameStop and AMC, the two canonical names from the meme-stock period. The authors also collect Google Trends values and stock price series, then align these with daily social signals.

The modelling stack has four relevant components.

First, TextBlob provides a generic emotional polarity score from -1 to 1. This is the simplest text-sentiment baseline: easy to use, easy to explain, and usually the first thing that gets put into a prototype before everyone agrees to pretend it was only temporary.

Second, the authors use an out-of-the-box Financial-RoBERTa model. This is a stronger finance-specific language model, but its original training domain is formal financial text: filings, earnings documents, CSR reports, ESG news, financial news, and similar material. That is useful for conventional finance language. It is less obviously suited to a Reddit comment that communicates investment conviction through rockets and group identity.

Third, the authors fine-tune Financial-RoBERTa on Reddit-style data labelled by ChatGPT. Reddit posts containing emojis are fed to ChatGPT with instructions to classify sentiment toward the stock as positive, neutral, or negative. Because negative examples are sparse, ChatGPT is also used to generate additional negative samples in the same style. The final training sample contains 10,525 examples: 7,010 positive, 2,219 neutral, and 1,296 negative.

Fourth, the paper uses an emoji counter. This is not a sentiment model in the usual sense. It just counts emojis. As a baseline, it is almost insulting. As a market signal, it turns out not to be ridiculous. That should make a few alternative-data pitch decks nervous.

The authors then run correlation and Granger causality tests. They also apply two adjustments to the price data. A shifted version asks whether the signal aligns with the current day rather than the next day. A stationary version uses daily price changes instead of absolute price levels, which is more appropriate for Granger testing because financial price levels are usually not stationary. These are not side quests. They are sensitivity and implementation choices that determine whether a result looks predictive, contemporaneous, or mostly reactive.

GME: the crowd-volume signal is much clearer than the sentiment signal

For GameStop, the attention variables look meaningfully connected to price. The number of comments has a Pearson correlation of 0.522 with stock price in the unshifted setup and 0.470 in the shifted setup. Kendall Tau correlations are also positive, at 0.480 and 0.443. Google Trends is slightly weaker but still substantial: Pearson correlations of 0.427 and 0.434, with Kendall Tau around 0.411 and 0.407.

Those are not magic numbers. They do not mean a trader could simply buy when Reddit gets loud and retire to a tasteful vineyard. But they do indicate that price, comment activity, and search activity moved together during the GME episode.

The Granger tests sharpen the interpretation. For GME, comment volume predicts stock price under the stationary setup with p = 0.018, and under the shifted-stationary setup with p = 0.0003. Google Trends predicts stock price under the stationary setup with p = 0.0008, though the shifted-stationary version is weaker at p = 0.0541. The reverse direction also appears in places: stock price predicts comments under stationary treatment with p = 0.0408, and stock price predicts Google Trends under stationary treatment with p = 0.0302.

The sensible reading is not “Reddit caused GME”. The sensible reading is feedback. Price moves create attention; attention may amplify price moves; both become part of the same event loop. The paper’s own use of Granger causality is better read as predictive precedence, not causal proof. The authors note this distinction in their method section, and it matters. Granger causality is not a subpoena from the universe.

Now compare that with sentiment.

For GME, TextBlob sentiment barely moves with price: Pearson correlations are -0.028 and -0.058. The out-of-the-box Financial-RoBERTa model does a little better in one configuration, with Pearson 0.230 unshifted, but Kendall Tau is weak. The fine-tuned model is also weak: Pearson 0.053 unshifted and -0.124 shifted. The emoji counter is negatively correlated with price, with Pearson values of -0.291 and -0.394.

That negative emoji result is interesting because it resists a simple “more rockets means higher price” story. During a meme event, emoji intensity can rise during stress, defence, panic, loyalty signalling, or post-drop mobilisation. It may reflect emotional heat, not directional optimism. In other words, “diamond hands” may appear precisely when the trade is under pressure. The crowd is not necessarily telling you where the price goes next. It may be telling itself not to sell.

There are some Granger signals involving the fine-tuned model and the emoji counter, but they are not stable enough to rescue the sentiment thesis. For example, GME stock price predicts emoji count at p = 0.0267 in one unadjusted setup, and stock price predicts fine-tuned sentiment at p = 0.0097. The fine-tuned model predicting stock price appears at p = 0.012 in a shifted setup, but this is not a clean, robust next-day forecasting result. It is better treated as a sign of entanglement between price and discourse than as deployable alpha.

The GME comparison is therefore quite stark. Attention behaves like an event detector. Sentiment behaves like a noisy translation layer sitting between the event and the analyst.

AMC: search interest leads the simple signals, while emojis become the anomaly

AMC tells a similar story with a different emphasis. Comment volume is positively correlated with AMC stock price, with Pearson correlations of 0.378 and 0.388. Google Trends is stronger: Pearson correlations of 0.476 and 0.483, and Kendall Tau correlations of 0.552 and 0.537.

That difference is business-relevant. GME’s comment activity is especially prominent, consistent with the idea that r/wallstreetbets discussion was tightly tied to the GME event. AMC, by contrast, appears more strongly associated with broader search interest. This may reflect a wider public-awareness channel rather than only subreddit-native mobilisation. The paper does not establish that mechanism conclusively, but the contrast is useful: not every meme stock has the same attention architecture.

The Granger tests again show stronger signals after stationarity adjustments. AMC stock price predicts comment volume under the stationary setup with p = 0.0007, while comment volume predicts stock price in the shifted-stationary setup with p = 0.0000. For Google Trends, stock price predicts search interest under stationary treatment with p = 0.0060, while Google Trends predicts stock price in the shifted-stationary setup with p = 0.0000.

This is the same feedback-loop shape as GME, but with search interest playing a stronger role. A risk system that only watched subreddit comment counts would likely miss part of the AMC signal. A system that combined forum intensity with public search acceleration would be better aligned with the evidence.

The sentiment results again disappoint the sentiment industry, which is a sentence one could reuse often. TextBlob is negatively correlated with AMC price by Pearson, around -0.204 and -0.216, but Kendall Tau is near zero. Out-of-the-box Financial-RoBERTa is weak. The fine-tuned Financial-RoBERTa is also weak, with Pearson 0.026 unshifted and -0.042 shifted.

The emoji counter is the exception. For AMC, it has a Pearson correlation of 0.444 unshifted and 0.309 shifted, with Kendall Tau of 0.398 and 0.306. In Granger testing, emoji count predicts AMC stock price with p = 0.0073 in the unshifted non-stationary setup and p = 0.0092 in a shifted non-stationary setup. The stationary versions, however, are not significant.

That makes the emoji result useful but fragile. It suggests that non-textual intensity may capture something the text classifiers miss. But it also shows why one should not promote “emoji sentiment” into a universal trading signal after one happy table. In AMC, emoji count behaves more like an attention-intensity proxy. In GME, it moves differently. The same crowd grammar can have different market meanings depending on the stock, timing, and phase of the event.

The fine-tuned model improves the language problem, not the market problem

The paper’s most important business lesson sits in the gap between modelling sophistication and predictive value.

Fine-tuning Financial-RoBERTa with ChatGPT-labelled Reddit posts is a reasonable technical move. The base model was trained on formal financial language. Reddit is not formal financial language. The model therefore needs help with slang, emojis, meme conventions, and stock-specific discourse. Using ChatGPT for annotation and augmentation is a practical way to produce labels at scale, especially when human labelling would be slow and expensive.

But the results suggest that the binding constraint is not merely language understanding.

The model may become better at classifying Reddit sentiment. That does not mean the classified sentiment is a strong predictor of price. Those are different claims. One is an NLP performance claim. The other is a market microstructure claim. The paper’s evidence supports scepticism about the second.

This distinction is where many commercial social-sentiment products quietly overreach. They present sentiment extraction as if it were already signal extraction. But a sentiment label is only useful if the market mechanism maps that label to future price movement. During a meme-stock episode, the mechanism may be mobilisation, visibility, reflexivity, liquidity demand, options activity, short-interest narratives, brokerage constraints, media amplification, or identity-driven holding behaviour. A positive/negative label compresses too much of that into one convenient but underpowered variable.

The result is not “NLP is useless”. The result is sharper: NLP may be useful for classifying narratives, detecting themes, separating sarcasm from conviction, and summarising discourse. But as a standalone price predictor, sentiment polarity is probably the wrong abstraction for this episode.

A better abstraction may be crowd state.

Is the crowd assembling? Is it spreading beyond the forum? Are comments accelerating faster than price? Is Google search interest lagging or leading? Are emojis clustering around holding, panic, triumph, or attack narratives? Are users coordinating action or merely reacting to price? Those questions are operationally richer than “is the average post positive?”

The evidence map: what each test is doing

The paper uses several tests and variants. They should not all be read as equal proof of prediction.

Evidence component Likely purpose What it supports What it does not prove
Pearson correlation Main evidence for linear co-movement Attention variables and prices move together more strongly than sentiment labels do Direction, causality, tradability
Kendall Tau Robust association check based on ranking The attention relationship is not only a linear artefact Economic magnitude or forecasting edge
Granger causality Predictive precedence test Some past values of comments/search/emoji signals help explain later price values in certain setups True causation or profitable trading strategy
Shifted price setup Timing sensitivity / implementation detail Whether signals align more with current-day price than next-day price Clean out-of-sample prediction
Stationary price-change setup Robustness/suitability adjustment for Granger testing Some relationships appear stronger when testing price changes rather than levels That all raw price relationships are reliable
Fine-tuned Financial-RoBERTa Model comparison and adaptation test Reddit-specific adaptation is plausible and technically motivated That better sentiment classification solves price prediction
Emoji counter Baseline and exploratory extension Non-textual expression can carry signal, especially for AMC Universal emoji-based market prediction

This table is also a useful design guide. If a vendor shows only a sentiment score and a correlation chart, that is not enough. The more important question is whether the signal is stable across timing assumptions, price transformations, stocks, and market regimes. A fragile signal may still be useful for monitoring. It should not be sold as forecast infrastructure.

Business use: build crowd-risk systems, not sentiment-alpha machines

The paper’s direct finding is narrow: for GME and AMC during January-March 2021, sentiment measures from TextBlob, Financial-RoBERTa, and a ChatGPT-annotated fine-tuned Financial-RoBERTa are weakly correlated with stock price and not consistently Granger-predictive. Simpler attention measures — comment volume and Google Trends — show stronger relationships. AMC’s emoji counter result adds an intriguing but unstable non-textual signal.

Cognaptus’ business inference is broader but still disciplined: social-media data is probably more valuable as an attention and narrative-risk layer than as a clean price-prediction engine.

For trading platforms, this means social data can help flag crowding risk, unusual retail attention, and possible volatility regimes. The product should not say “Reddit is bullish, buy”. It should say “attention is accelerating, public search is rising, discourse is concentrated, and price is entering a feedback-sensitive regime.” That is a different tool. It is also a less embarrassing one.

For brokerages and exchanges, the relevant use case is operational monitoring. Meme-stock episodes stress customer support, liquidity routing, margin systems, communications, and risk controls. A dashboard that detects fast-rising attention around specific tickers can be useful even if it never predicts the closing price. Prevention beats explaining to regulators that the rocket emojis looked harmless.

For alternative-data vendors, the pricing logic changes. A premium product should not merely add a fine-tuned sentiment model and call it “AI-powered retail intelligence”. The paper suggests the better product bundle is multi-layered:

Product layer Useful signal Operational value
Attention Comment volume, search interest, acceleration Detects crowd arrival and public awareness
Intensity Emoji density, posting bursts, ticker concentration Captures emotional and mobilisation pressure
Narrative Themes such as short squeeze, hold, sell, betrayal, institutional conflict Explains why the crowd is active
Timing Lead/lag between price moves, Reddit activity, and search activity Helps distinguish reaction from possible amplification
Risk state Combined alert score across attention, volatility, and narrative Supports monitoring, not blind trading

The ROI case is therefore not “our model predicts meme stocks”. It is “our system detects when a stock is entering a social-feedback regime”. That is more defensible, more useful, and less likely to be disproved by the next p-value.

Why sentiment polarity is a poor container for meme-stock behaviour

Sentiment analysis assumes that positive and negative language are meaningful directional summaries. For consumer reviews, that often works. “This laptop is excellent” and “this laptop exploded” are not hard to interpret, at least until someone launches a ruggedised laptop brand called Exploded.

Meme-stock discourse is different. A negative-sounding post can be bullish if it attacks short sellers. A panicked post can be part of a hold-the-line ritual. A rocket emoji can express conviction, irony, group belonging, or desperate hope. “Diamond hands” may be strongest when the price is falling, precisely because the community needs to reinforce holding behaviour. The same token can encode both optimism and stress.

That means sentiment polarity may erase the mechanism. The market-relevant question may not be whether users sound positive. It may be whether users are coordinating around a shared action, reinforcing identity, resisting sell pressure, or attracting outside attention.

This is where the paper’s fine-tuning result is so useful. Even after adapting the model to Reddit language, sentiment remains weak. That implies the issue is not only that generic models misunderstand slang. The deeper issue is that bullishness, mobilisation, and price impact are not the same variable.

For business users, this is the difference between language analytics and behavioural analytics. Language analytics asks what the text says. Behavioural analytics asks what the crowd is doing. Meme stocks punish anyone who confuses the two.

The boundary conditions are not footnotes; they define the product claim

This study should not be stretched beyond its evidence.

First, the sample is narrow. Two stocks, one subreddit, and a three-month window around an extraordinary retail-trading episode are not a general theory of social-media finance. The findings may not apply to large-cap earnings reactions, crypto assets, small-cap pump campaigns, or post-2021 market behaviour.

Second, the data is daily. Intraday meme-stock dynamics can move quickly. A signal that is weak at daily frequency might matter at hourly or minute-level resolution, and a signal that appears predictive at daily frequency might disappear when execution constraints are considered.

Third, Granger causality is not causal identification. It tests whether past values of one series help predict another series, conditional on the setup. It does not establish that Reddit activity caused price changes, nor does it isolate confounders such as news coverage, options activity, short interest, brokerage restrictions, or broader market conditions.

Fourth, the paper does not present a validated trading strategy with transaction costs, slippage, risk controls, and out-of-sample testing. That matters because a statistically interesting signal can still be commercially useless if it is late, crowded, unstable, or too expensive to trade.

Fifth, ChatGPT-labelled training data inherits annotation risk. The model can label slang efficiently, but automatic labels are not the same as ground truth. More importantly, even perfect sentiment labels would not solve the problem if sentiment itself is the wrong target.

These are not criticisms that weaken the article’s main point. They sharpen it. The correct product claim is not “Reddit sentiment predicts stocks”. The correct claim is “some social attention variables are useful indicators of meme-stock feedback regimes, while sentiment polarity alone is weak evidence for price prediction.”

Less exciting, yes. Also less wrong.

The practical takeaway: measure mobilisation before mood

The paper’s best insight is not that sentiment fails. It is that a more expensive sentiment pipeline can lose to simpler measures of attention.

That should change how teams build market-intelligence systems. Start with crowd mobilisation: comment volume, search interest, acceleration, concentration, and timing. Add NLP where it explains the narrative structure behind the mobilisation. Use sentiment as one feature, not as the headline. Treat emojis as intensity markers whose meaning must be calibrated by stock, community, and phase of the event.

For GME, comment activity and Google Trends moved much more clearly with price than sentiment labels did. For AMC, Google Trends and the emoji counter were more informative than the text classifiers. Across both cases, the fine-tuned Reddit sentiment model did not transform online language into dependable price prediction.

That is the useful disappointment. The crowd may matter, but not because its average sentence can be labelled positive or negative. It matters because attention concentrates, spreads, loops back into price, and changes the behaviour of market participants.

The market was not reading Reddit like a sentiment analyst. It was reacting to mobilisation. Anyone building tools for this space should do the same.

Cognaptus: Automate the Present, Incubate the Future.


  1. Mateusz Kmak, Kamil ChmurzyĹ„ski, Kamil Matejuk, PaweĹ‚ Kotzbach, and Jan KocoĹ„, “Predicting stock prices with ChatGPT-annotated Reddit sentiment: Hype or reality?”, arXiv:2507.22922, 2025. https://arxiv.org/pdf/2507.22922 ↩︎