Sound and Fury Signifying Stock Picks

TL;DR for operators

Finfluencer videos are not just “text with a face attached.” They contain ticker symbols on charts, spoken recommendations, gestures, confidence, hedging, hype, and the occasional performance of certainty. VideoConviction turns that mess into a benchmark: 288 YouTube videos from finance influencers, 687 stock recommendation segments, 6,063 expert annotations, transcripts, metadata, and a 1–3 conviction score grounded in tone, facial expression, delivery, and consistency between title and content.¹

The useful operational lesson is not “AI can now watch finance YouTube and trade for you.” Please do not build that product and call it wisdom. The paper shows something narrower and more valuable: multimodal models can use video to improve ticker extraction, especially when charts display symbols, but they still struggle to identify the recommendation action and the speaker’s conviction. The benchmark’s hardest task—extracting ticker, action, and conviction together—stays below 30 F1 even for the best models in the table.

The portfolio analysis is equally sobering. A plain buy-and-hold strategy based on finfluencer recommendations trails QQQ and SPY over the paper’s 2018–2024 backtest. High-conviction recommendations do better than low-conviction ones, but not enough to beat the stronger passive benchmark. The inverse finfluencer strategy has the highest annual return in the headline table, but it comes with lower risk-adjusted performance than the index funds and collapses in the non-penny-stock sensitivity test. In other words: charisma may be a signal, but it is not a clean asset-pricing factor. Annoying, but markets often are.

For businesses, the near-term value is monitoring and diagnosis: segment videos, extract tickers and actions, estimate conviction, flag persuasive but weak recommendations, and support compliance or investor education. The boundary is important. This paper supports better social-finance intelligence. It does not validate fully automated trading from influencer content, nor does it prove that confidence causes returns.

The evidence starts with a familiar problem: a loud video, a vague trade

Anyone who has watched retail-finance YouTube knows the format. A thumbnail screams that a stock is about to explode. The title suggests urgency. The speaker opens with market colour, then mentions three tickers, walks through a chart, adds a sponsor segment, says one stock is “interesting,” says another is “a buy at the right price,” and ends by reminding viewers that this is not financial advice. Elegant, like a smoke alarm giving portfolio allocation guidance.

For a human, the ambiguity is irritating. For an AI system, it is the task.

The central contribution of VideoConviction is that it treats this ambiguity as data rather than noise. The researchers do not merely ask whether a model can transcribe a video. They ask whether a model can extract the structure of a recommendation:

Task	What the model must extract	Why it gets harder
T	Ticker	Often visible in charts or spoken directly
TA	Ticker + action	Requires distinguishing commentary from advice
TAC	Ticker + action + conviction	Requires interpreting intent, delivery, and strength of belief

That progression is the right way to read the paper. The easy part is naming the object. The hard part is determining what the speaker is actually telling viewers to do, and how forcefully they are doing it.

This is also where the paper becomes business-relevant. Most practical systems do not fail because they cannot recognise “AAPL.” They fail because they cannot reliably distinguish “Apple is worth watching” from “buy Apple now,” or “I am cautiously constructive” from “this is a must-buy.” The first is entity extraction. The second is recommendation interpretation. The third is behavioural persuasion analysis. Those are not the same job, however much vendor demos enjoy pretending otherwise.

Video helps models read tickers, not judgement

The benchmark results are clean enough to be uncomfortable. Multimodal inputs improve ticker extraction. That makes intuitive sense: stock videos often show tickers directly in charts, slides, watchlists, or screenshots. If the ticker is on the screen, a video-capable model has another route to the answer. It can look, not just listen.

On segmented video inputs, top proprietary multimodal models reach the mid-80s F1 range on ticker extraction. Gemini 1.5 Flash reaches 86.23 F1 on T, Gemini 2.0 Pro reaches 86.04, Gemini 1.5 Pro reaches 86.01, and GPT-4o reaches 83.47. Text-only models are weaker on this exact task, though still respectable; Qwen2.5-72B reaches 79.20 and DeepSeek-V3 reaches 77.56 on segmented transcripts.

Then the task asks for action.

Performance drops. On segmented inputs, the best TA scores sit in the low-to-mid 50s: Gemini 2.0 Pro reaches 54.28, Gemini 1.5 Flash reaches 54.21, Gemini 1.5 Pro reaches 53.51, and DeepSeek-V3 reaches 51.35. That is not catastrophic, but it is no longer the confident competence suggested by ticker extraction.

Then the benchmark asks for ticker, action, and conviction together. This is where the machines start looking less like analysts and more like interns asked to read body language during a quarterly earnings call. The best TAC score in the table is DeepSeek-V3 on segmented transcripts at 28.17 F1. GPT-4o on segmented video reaches 27.86. Gemini 2.0 Pro on segmented video reaches 25.21. No model crosses 30.

That pattern matters more than the ranking of any individual model.

Result	Likely purpose in the paper	What it supports	What it does not prove
Multimodal inputs improve ticker extraction	Main benchmark evidence	Visual cues help when the target appears on-screen	Video understanding is generally solved
TA performance drops sharply from T	Main benchmark evidence	Recommendation action requires financial and conversational interpretation	Models cannot be useful at all
TAC remains below 30 F1	Main benchmark evidence	Conviction extraction is still weak	Human conviction labels are perfect or universal
Segmented inputs beat full-length inputs	Main evidence plus implementation lesson	Scoping context reduces noise	Any segmentation method will work
Portfolio backtest links labels to returns	Exploratory financial outcome analysis	Conviction and advice can be studied against market outcomes	Finfluencer trading is a validated strategy
Non-penny-stock analysis	Robustness/sensitivity test	Headline inverse returns are assumption-sensitive	All finfluencer content is worthless

The paper’s strongest AI result is therefore not “multimodality wins.” It is more precise: multimodality helps where the visual channel contains explicit financial identifiers, but it does not yet solve the harder interpretive layer. A model can see the ticker on the chart and still miss whether the speaker is recommending, hedging, joking, recapping, or performing conviction for the algorithmic gods of engagement.

Segmentation is not a convenience; it is the control knob

One of the more operationally useful findings is that segmented inputs outperform full-length inputs. This sounds like a preprocessing detail until you imagine the actual video.

Full-length finfluencer content contains intros, disclaimers, sponsor reads, portfolio updates, macro commentary, multiple tickers, jokes, YouTube rituals, and irrelevant tangents. Feeding all of that into a model and asking for one clean recommendation is not “giving it more context.” It is making the model rummage through a garage.

The paper’s segment-level approach does two things. First, it isolates the relevant recommendation window. Second, it lets the benchmark compare full-video understanding against focused recommendation understanding. That distinction matters for deployment.

A financial platform trying to monitor influencer recommendations should not begin with one giant prompt over a complete video. A more realistic architecture would be:

detect candidate recommendation segments;
transcribe full video and segments;
extract ticker and action from the segment;
use surrounding context only where ambiguity remains;
score conviction separately from recommendation validity;
route uncertain or high-impact cases to review.

This is not glamorous architecture. It is merely useful, which is sometimes better.

Segmentation also changes the business economics. If the model sees less irrelevant material, inference becomes cheaper, review becomes easier, and the system can attach confidence to a specific timestamp rather than to the mood of a whole video. For compliance, auditability, or investor-warning products, timestamped evidence is not a nice extra. It is the difference between “the model thinks this video is risky” and “at 06:14, the speaker recommends buying TSLA with high conviction.”

Conviction is a weak signal, not a trading strategy wearing sunglasses

The paper’s conviction label is deliberately multimodal. A low score reflects hesitant tone, doubtful expressions, frequent qualifiers, or a mismatch between a bold title and a weak delivery. A medium score reflects moderate confidence. A high score reflects assertive language, energetic delivery, fewer qualifiers, and alignment between title and video.

This is useful because conviction is not the same as sentiment. A speaker can be positive but cautious, negative but tentative, or extremely confident while being spectacularly wrong. We have all met that person. Some have podcasts.

The portfolio analysis asks whether conviction maps onto returns. The answer is: somewhat, but not enough to worship it.

From January 1, 2018 to August 1, 2024, the authors simulate strategies starting from $100, with trades held for six months. The core results are:

Strategy / benchmark	Sharpe	PnL ($)	Cumulative return	Annual return
Inverse YouTuber	0.41	195.38	195.38%	17.90%
QQQ	0.68	189.74	189.74%	17.55%
SPY	0.65	102.02	102.02%	11.28%
Buy-and-Hold	0.46	84.29	84.29%	9.74%
Buy-and-Hold weighted by conviction	0.30	33.35	33.35%	4.47%

The simple reading is that finfluencer buy recommendations underperform passive index funds. The more interesting reading is that the conviction-weighted strategy performs worse than the unweighted buy-and-hold strategy. That should make operators cautious. A high-conviction speaker may be more informative than a low-conviction speaker, but using conviction as a naive allocation weight can still produce poor portfolios.

The paper’s conviction split sharpens this point. High-conviction recommendations produce a total return of 64.15%, medium-conviction recommendations produce 40.41%, and low-conviction recommendations lose 19.65%. So yes, conviction contains information. It helps separate the worst calls from less-bad calls. But high conviction still lags the index benchmark in the paper’s analysis.

That is the uncomfortable middle: confidence is not meaningless, but it is not alpha. It is a noisy behavioural feature. Treating it as a risk flag or triage signal is reasonable. Treating it as an automated buy signal is how one converts machine learning into expensive theatre.

The inverse strategy is a warning light, not a product roadmap

The most headline-friendly result is the inverse finfluencer strategy. It delivers the highest annual return in the main table: 17.90%, slightly above QQQ’s 17.55%. But the Sharpe ratio is lower than both QQQ and SPY, which means the extra return comes with weaker risk-adjusted performance. That already changes the interpretation. A strategy can win on raw return and still be less attractive once volatility enters the room, as volatility rudely tends to do.

More importantly, the appendix makes the inverse result look fragile. When the paper excludes penny stocks—defined as stocks trading below $5—the dataset falls from 687 recommendations to 567. In that non-penny-stock subset, the inverse strategy becomes the worst performer, with -146.40% cumulative return and -20.97% annual return. Meanwhile, non-penny buy-and-hold improves to 122.19% cumulative return and 12.90% annual return, beating SPY on raw return but still trailing QQQ.

This sensitivity test is not a side note. It is the difference between “bet against influencers” and “the inverse result may be heavily shaped by volatile low-priced stocks and shorting assumptions.” The first sentence sells newsletters. The second sentence is what adults should say.

The quantile analysis adds another useful boundary. The paper reports that only the top 20% of recommended stocks outperform QQQ; the remaining 80% underperform it. That does not mean every finfluencer recommendation is useless. It means the distribution is unforgiving. Some picks work very well, but most do not beat a simple high-growth index benchmark.

So the financial conclusion should be carefully phrased:

\ast The paper directly shows that broad passive benchmarks outperform finfluencer buy-and-hold strategies in the tested setup. \ast It directly shows that high-conviction picks outperform low-conviction picks but still fail to beat QQQ. \ast It directly shows that inverse recommendations can win on headline annual return in the main test but carry higher risk and fail badly when penny stocks are excluded. \ast Cognaptus infers that conviction is better used for monitoring and risk scoring than for direct allocation. \ast What remains uncertain is whether these relationships hold across platforms, markets, time periods, influencer categories, model improvements, and realistic transaction constraints.

That last point is not decorative caution. It changes the product.

The business value is investor-protection intelligence, not auto-trading cosplay

VideoConviction is most valuable as infrastructure for interpreting persuasive financial media. That has several business uses, none of which require pretending the model is Warren Buffett with a webcam.

First, wealth platforms can use this style of system to warn users when they are acting on highly persuasive but weakly grounded content. A platform does not need to say, “This stock will underperform.” It can say, “This recommendation is high-conviction, action-oriented, and concentrated in a speculative individual stock; compare it with your stated risk tolerance before trading.” That is dull. It may also prevent expensive mistakes. Dull wins more often than pitch decks admit.

Second, brokerages and consumer-finance apps can build social-content risk layers. If a user arrives from a video link or searches a ticker after a viral clip, the platform can surface context: whether the clip contained an explicit buy/sell recommendation, how strong the recommendation was, and whether similar recommendations historically outperformed benchmarks. This is not personalised advice. It is friction against impulsive action.

Third, compliance teams can monitor influencer content at scale. The relevant question is not only whether a ticker appears, but whether the speaker creates a call to action. That is exactly where TA and TAC become practical. The model’s weakness is also the business requirement: action and conviction need tighter extraction, better review workflows, and explainable timestamps.

Fourth, asset managers and market-intelligence teams can use such data as an alternative signal—not to trade blindly, but to map attention, persuasion, and retail narratives. A spike in high-conviction recommendations around a low-liquidity stock is not automatically a buy or sell signal. It is a market microstructure event worth watching.

The ROI logic is therefore not “generate alpha from YouTube.” It is:

Use case	What VideoConviction-like systems can provide	What humans still need to decide
Investor protection	Detect persuasive stock recommendations and flag risky calls	Whether and how to intervene
Compliance monitoring	Timestamped ticker/action/conviction evidence	Legal interpretation and escalation
Wealth-platform UX	Contextual warnings before impulsive trades	Suitability and user-specific advice rules
Social-finance analytics	Maps of attention, conviction, and recommendation clusters	Whether signal quality justifies trading use
Model benchmarking	Stress tests for multimodal financial understanding	Deployment thresholds and review policies

The pattern is clear: the system is valuable as a diagnostic layer before action, not as an action engine by itself.

The paper’s hardest lesson is that “what was said” is cheaper than “what was meant”

There is a reason TAC is difficult. Conviction lives in the gap between language and performance.

A transcript can capture “I think this is a buy.” It may not capture whether the speaker pauses, hedges, smirks, overacts, reads from a slide, contradicts the title, or sounds like they are filling time before the sponsor coupon code. Video and audio should help, but the paper shows that current multimodal systems still do not reliably convert those signals into human-aligned conviction labels.

This creates a useful correction to a common belief about multimodal AI. The belief is that adding video makes the model more human-like. The correction is that adding video gives the model more channels, not necessarily better judgement. The replacement belief should be: multimodality helps when the added channel contains explicit evidence, but it still needs task-specific supervision, segmentation, evaluation, and review when the target is social meaning.

In finance, social meaning is expensive. It is not just sentiment. It includes intent, persuasion, uncertainty, incentives, and audience framing. A model that extracts all tickers perfectly but misunderstands whether a statement is actionable can still be dangerous. In fact, it may be more dangerous, because it looks competent on the part of the task that is easiest to measure.

That is why VideoConviction is a useful benchmark. It does not reward models for merely watching. It asks whether watching changes interpretation.

Boundaries that matter before anyone builds the dashboard

The paper is strongest when read within its boundaries.

The dataset focuses on U.S. stock recommendations from YouTube finfluencers, using channels selected from a curated list and videos from 2018 to 2024. That is already a specific media environment. YouTube finance is not TikTok, Reddit, Discord, WeChat, or private trading groups. The speaking style, content length, disclosure norms, and audience behaviour differ across platforms.

The model evaluation is zero-shot. That makes the benchmark useful for measuring general capability, but it does not tell us the ceiling after domain-specific fine-tuning, better video encoders, stronger segmentation, or task-specific calibration. The low TAC results are therefore not a permanent law of nature. They are a measurement of current systems under the tested setup.

The portfolio analysis uses historical backtesting. It is useful because it ties annotations to outcomes, but it is not a deployable strategy specification. Trading costs, liquidity, shorting feasibility, borrow costs, timing assumptions, survivorship issues, platform effects, and real-time availability can all change results. The non-penny-stock sensitivity test already shows how quickly one appealing result can change under a reasonable filter.

The conviction label is expert-annotated, but conviction is not a universal psychological constant. Cultural presentation norms, camera style, editing, and influencer persona can affect perceived confidence. A model trained to detect U.S. YouTube-style conviction may not transfer cleanly to another language, market, or platform.

These boundaries do not weaken the paper. They make it usable. Good benchmarks tell us where systems fail under controlled conditions. Bad product teams then remove the controls and call the failure “innovation.” Let us not.

The operator’s takeaway: build sceptical machines first

VideoConviction lands in a useful place between AI benchmarking and market behaviour. It shows that persuasive financial content can be structured into extractable signals, but also that the most important signals are not yet easy for models to read. Tickers are comparatively cheap. Actions are harder. Conviction is harder still. Returns are harder than all of them, because the market has a charming habit of humiliating clean narratives.

For operators, the right response is not to ignore finfluencer media. Retail attention matters. Persuasion matters. High-conviction financial claims can move behaviour even when they do not improve outcomes. But the correct first system is not an automated trader. It is a sceptical interpreter.

A useful system would watch the video, isolate the recommendation, extract the ticker and action, estimate conviction, attach timestamps, compare the implied trade against user risk and benchmark context, and escalate uncertainty. That is less glamorous than “AI that trades YouTube.” It is also less likely to set customer money on fire.

The real lesson is therefore simple: confidence is part of the signal, but not the signal you should blindly buy. Machines can help measure the sound and fury. They are not yet very good at deciding what it signifies.

\ast\astCognaptus: Automate the Present, Incubate the Future.\ast\ast

Michael Galarnyk, Veer Kejriwal, Agam Shah, Yash Bhardwaj, Nicholas Watney Meyer, Anand Krishnan, and Sudheer Chava, “VideoConviction: A Multimodal Benchmark for Human Conviction and Stock Market Recommendations,” arXiv:2507.08104, 2025. https://arxiv.org/abs/2507.08104 ↩︎

TL;DR for operators#

The evidence starts with a familiar problem: a loud video, a vague trade#

Video helps models read tickers, not judgement#

Segmentation is not a convenience; it is the control knob#

Conviction is a weak signal, not a trading strategy wearing sunglasses#

The inverse strategy is a warning light, not a product roadmap#

The business value is investor-protection intelligence, not auto-trading cosplay#

The paper’s hardest lesson is that “what was said” is cheaper than “what was meant”#

Boundaries that matter before anyone builds the dashboard#

The operator’s takeaway: build sceptical machines first#