TL;DR for operators
The paper behind THEME is not really about asking an LLM to “find AI stocks” and hoping it returns a genius portfolio, because that would be the usual theatre with a Bloomberg terminal costume.1 It is about building a retrieval layer that understands investment themes as a special kind of search problem: cross-sector, text-heavy, time-sensitive, and annoyingly allergic to static classification.
THEME does three useful things. First, it trains embeddings so that theme descriptions and constituent stock profiles live closer together in vector space. Second, it adds recent return behaviour through a lightweight temporal adapter, so the model does not merely retrieve companies that sound relevant; it tries to rank companies that are both thematically aligned and currently investable. Third, it builds a broader Thematic Representation Set, or TRS, so supervision is not trapped inside the fashionable corners of the thematic ETF market.
The paper’s strongest evidence is retrieval performance. Across several embedding backbones, THEME materially improves Hit Rate and Precision at top-k cutoffs. For Linq-Embed-Mistral, HR@3 rises from 0.5155 to 0.8196 and P@3 from 0.3522 to 0.6289. That matters because thematic investing starts with universe construction. If the candidate pool is wrong, the optimiser is just arranging furniture in the wrong house.
The portfolio results are promising but should be read with adult supervision. The authors test equal-weighted top-k portfolios from April 23, 2024 to April 29, 2025, chaining 14-trading-day forward returns. For gte-Qwen2-7B-instruct, THEME lifts SR@3 from 0.5014 to 0.7592 and cumulative return at top-3 from 0.0917 to 0.1645. But the GritLM-7B baseline is already strong, and THEME does not improve every metric at every cutoff. The useful conclusion is not “embeddings beat markets.” It is: task-specific semantic-temporal retrieval can produce better thematic shortlists under the paper’s setup.
For business use, the cleanest deployment is not autonomous trading. It is a thematic intelligence layer: generate candidate universes, refresh theme exposure, support analyst screening, feed index construction, or personalise retail thematic baskets. Then add liquidity filters, risk controls, transaction-cost modelling, compliance screens, and the usual boring machinery that keeps a research demo from walking into production wearing sunglasses.
Themes are moving search problems, not sectors with better branding
A portfolio manager hears: “We want exposure to AI software and chipmakers.”
That sentence looks simple. It is not. Does it include cloud infrastructure? Semiconductor equipment? Edge devices? Data-centre power systems? Cybersecurity platforms selling into AI workloads? Companies whose latest investor deck says “AI” 27 times because apparently punctuation was too subtle?
Traditional sector labels do not solve this. ETF holdings help, but they are snapshots of someone else’s methodology, rebalance calendar, coverage bias, and commercial product design. Keyword screens are fast, but dumb. Generic embeddings are smarter, but they are usually trained to understand text similarity in general, not to decide whether a company belongs inside a theme-driven investment universe.
THEME starts from this practical annoyance. The paper treats thematic investing as retrieval first and portfolio construction second. That order is important. In thematic investing, the first operational problem is not weighting. It is identification: which companies should even be candidates for the theme?
The paper’s mechanism-first logic is therefore stronger than a scoreboard-first summary. The experiments matter because they test whether each layer of the mechanism earns its keep:
| Paper component | Practical problem it addresses | Evidence role |
|---|---|---|
| Theme-stock semantic alignment | Generic embeddings do not naturally cluster stocks by investment theme | Main retrieval evidence |
| Temporal refinement with recent returns | Pure theme fit may retrieve conceptually relevant but poorly timed names | Portfolio utility test |
| TRS dataset expansion | ETF-only supervision is narrow and biased toward popular themes | Ablation on training data |
| Theme descriptions as anchors | Stock-to-stock similarity may learn local co-holding patterns rather than the theme itself | Ablation on anchor strategy |
This is the paper’s actual contribution: not a talking chatbot for investors, but a representation-learning pipeline for turning vague thematic language into ranked stock candidates.
THEME’s first move is to teach embeddings what “on-theme” means
The first stage is semantic alignment. THEME uses a pre-trained text embedding model as the backbone and adapts it with LoRA, keeping the base model largely frozen while training a smaller set of parameters. The training signal is the relationship between a theme description and the stocks known to belong to that theme.
The contrastive setup is intuitive. A theme description is the anchor. A constituent stock is the positive example. Non-constituent stocks are negatives. The model learns to pull theme and relevant stock embeddings closer together while pushing irrelevant stocks away.
That matters because financial themes are not ordinary text categories. “Future vehicles” may include automakers, battery firms, semiconductor suppliers, charging infrastructure, fleet software, and materials companies. A general embedding model can recognise that those words live near each other. It does not automatically know how those relationships should translate into an investment universe.
The paper’s Figure 1 makes the point visually through a t-SNE comparison: the baseline embedding space fails to group stocks cleanly by investment theme, while THEME-tuned embeddings form more distinct thematic clusters. The figure is not the main proof; t-SNE plots are charming but not legally binding. Its purpose is interpretability. It shows why the retrieval tables later have a plausible mechanism behind them.
The important design choice is that THEME uses theme text as the anchor rather than only aligning stocks with other stocks. That sounds minor. It is not. Stock-stock alignment can learn that two companies often appear in the same ETF, but that does not necessarily teach the model the abstract concept that makes them belong together. Theme anchoring forces the representation to organise around the narrative itself.
In practical terms, this is the difference between “these stocks are often neighbours” and “these stocks answer the same investment question.”
The second move is to add market timing without pretending semantics are alpha
After semantic alignment, THEME adds a temporal refinement stage. This uses a lightweight two-layer adapter that combines a stock’s semantic embedding with its recent daily return history. The adapter is trained with a triplet loss: for a given theme, two constituent stocks are compared, and the one with higher forward return over the chosen horizon is treated as the positive example.
Mechanically, the theme remains the anchor. The adapter learns to place the higher-forward-return stock closer to the theme anchor than the lower-forward-return stock, by at least a margin.
This is a careful move. The paper does not replace theme relevance with momentum. It fuses recent return dynamics into the representation after the semantic space has already been shaped. That sequence matters. If the temporal layer dominated too much, the system would become a momentum screener with decorative narrative labels. If semantics dominated too much, the system would retrieve beautifully thematic names that may have stopped working six months ago.
THEME is trying to sit between those failures. The semantic layer answers: “What belongs to the theme?” The temporal layer answers: “Among the relevant names, what looks more suitable now?”
This does not prove stock-picking genius. It proves that the representation can be trained to encode relative short-horizon return information within a theme. In a production setting, that is best treated as a ranking prior, not an execution mandate.
TRS is the quiet data contribution
The dataset work is less flashy than the model, which means it is probably more important.
THEME begins with 1,153 real-world thematic ETFs, covering about 3,000 unique U.S. equities. The authors filter for ETFs with at least 10 constituents, leaving 969 ETFs split into 678 training, 97 validation, and 194 test ETFs. The training data is then augmented through the Thematic Representation Set, expanding the thematic universe to 196 themes using sector and industry taxonomies plus news-derived information.
This matters because ETF supervision is both useful and biased. It is useful because ETF holdings are real market artefacts: someone has already defined a theme and mapped it to constituents. But ETF coverage is not a neutral map of the world. It tends to overrepresent commercially attractive themes: clean energy, technology, AI, cybersecurity, and whatever else has recently survived the product committee.
TRS is the paper’s attempt to break out of that ETF-only bottleneck. It lets stocks belong to multiple themes, draws from richer textual profiles built from SEC filings and financial news, and broadens the learning signal beyond existing products.
For operators, the lesson is blunt: the model architecture is not enough. A thematic retrieval system is only as good as the theme-stock supervision it sees. If the training data reflects yesterday’s ETF shelves, the model will be very good at rediscovering yesterday’s ETF shelves. Congratulations, you built a mirror with cloud hosting.
Retrieval is the main evidence, and the gains are large
The retrieval experiment is the paper’s core benchmark. Each model ranks stocks for a given theme description, and the authors evaluate Hit Rate and Precision at top-k. Hit Rate measures whether at least one relevant stock appears in the retrieved set. Precision measures how much of the retrieved set is relevant.
The paper compares THEME-enhanced versions of several embedding backbones against their vanilla versions, and also reports strong LLM baselines such as GPT-4.1 and Gemini-2.5. The results are not subtle.
| Model | Vanilla HR@3 | THEME HR@3 | Vanilla P@3 | THEME P@3 | Interpretation |
|---|---|---|---|---|---|
| bge-small-en-v1.5 | 0.1392 | 0.6031 | 0.0584 | 0.3557 | A small model becomes competitive after task-specific tuning |
| SFR-Embedding-Mistral | 0.3711 | 0.7887 | 0.2062 | 0.6082 | Strong retrieval lift at 7B scale |
| GritLM-7B | 0.0825 | 0.8196 | 0.0344 | 0.6014 | The vanilla model performs poorly here; THEME changes the task fit dramatically |
| gte-Qwen2-7B-instruct | 0.5206 | 0.7938 | 0.3299 | 0.5790 | A strong vanilla retriever still improves |
| Linq-Embed-Mistral | 0.5155 | 0.8196 | 0.3522 | 0.6289 | One of the best combined retrieval results |
The LLM baselines are worth reading carefully. GPT-4.1 records HR@3 of 0.7113 and P@3 of 0.5189. Gemini-2.5 records HR@3 of 0.6494 and P@3 of 0.4020. Several THEME-enhanced embedding models surpass these numbers.
That is the point most business readers will be tempted to misread. The result does not mean “LLMs are bad at finance.” It means that for this retrieval task, a specialised embedding system trained on theme-stock relationships can outperform general-purpose LLM retrieval behaviour. The model does not need to know everything. It needs to know the task geometry.
The smaller-model result is also operationally interesting. bge-small-en-v1.5 has only 33M parameters, yet THEME lifts it to HR@3 of 0.6031 and P@3 of 0.3557. That does not make it the best model in the table, but it suggests that specialised supervision can matter more than simply buying a larger general model and asking it nicely.
Portfolio construction tests investment utility, not production readiness
The second experiment asks whether better retrieval leads to better portfolios. The authors construct equal-weighted portfolios from the top-k retrieved stocks and test them over a rolling period from April 23, 2024 to April 29, 2025. For each window, the portfolio holds the top-k names ranked by similarity to the theme anchor, records average daily return over the next 14 trading days, then chains those returns into a continuous series.
The metrics are Sharpe Ratio, cumulative return, and maximum drawdown. The paper also reports a real thematic ETF benchmark with Sharpe ratio of 0.4845, cumulative return of 0.0672, and maximum drawdown of -0.2368.
Here the evidence is promising, but less uniform than the retrieval table.
| Backbone | Variant | SR@3 | MDD@3 | CR@3 | What it means |
|---|---|---|---|---|---|
| Linq-Embed-Mistral | Vanilla | 0.4870 | -0.2551 | 0.0907 | Reasonable baseline |
| Linq-Embed-Mistral | THEME | 0.5881 | -0.2526 | 0.1187 | Moderate improvement |
| gte-Qwen2-7B-instruct | Vanilla | 0.5014 | -0.2427 | 0.0917 | Similar to ETF benchmark on Sharpe |
| gte-Qwen2-7B-instruct | THEME | 0.7592 | -0.2378 | 0.1645 | Strongest practical lift in the reported examples |
| GritLM-7B | Vanilla | 0.5744 | -0.2431 | 0.1154 | Already strong |
| GritLM-7B | THEME | 0.5952 | -0.2683 | 0.1196 | Slight SR/CR gain at top-3, worse drawdown |
The gte-Qwen2-7B-instruct case is the clearest investment-utility result. THEME improves SR@3 from 0.5014 to 0.7592 and CR@3 from 0.0917 to 0.1645, while maximum drawdown improves slightly from -0.2427 to -0.2378. At top-5, the same backbone improves SR from 0.4293 to 0.7711 and CR from 0.0749 to 0.1650.
Linq-Embed-Mistral also improves across the top-3 and top-5 metrics shown in the paper. But GritLM is mixed. THEME gives it a small top-3 Sharpe and cumulative return lift, yet maximum drawdown worsens at top-3, and SR/CR fall at top-5 and top-10 compared with the vanilla GritLM baseline.
That mixed result is not an embarrassment. It is useful. It says THEME is most valuable when the base model is not already well aligned to the task. If the baseline representation already captures relevant structure, temporal refinement may add less and can sometimes perturb a good ranking. Models, like consultants, do not improve every meeting by being invited.
The practical conclusion is therefore narrower than the abstract would like: THEME improves portfolio metrics for several backbones under a specific rolling, equal-weighted, short-horizon setup. It does not prove a complete tradable strategy after transaction costs, turnover constraints, taxes, liquidity limits, benchmark risk, and compliance overlays.
The ablations explain why the mechanism works
The third experiment is an ablation study. Its purpose is not to introduce a second thesis. It tests whether two key design choices actually matter: using theme text as the anchor, and training with TRS rather than ETF-only data.
The first ablation compares theme-based anchoring against stock-stock alignment. In stock-stock alignment, both anchor and positive examples are constituent stocks from the same ETF. That can teach local similarity. Theme-based anchoring instead uses the textual theme description as the anchor and a constituent stock as the positive.
The paper reports positive precision improvements for the theme-anchor setup across every listed backbone and cutoff. For gte-Qwen2-1.5B-instruct, P@3 improves by 0.1908; the text notes this corresponds to an increase from 0.2216 to 0.4124. For GritLM-7B, the P@3 improvement is 0.4261, unusually large because the stock-stock alternative appears poorly suited to that backbone in this task.
The implication is simple: train the model on the abstraction you want it to serve. If users will ask for “AI software and chipmakers,” “water scarcity infrastructure,” or “edge-device cybersecurity,” the model should be organised around theme language, not only around co-held stocks.
The second ablation compares ETF-only training against TRS. Again, TRS improves precision across the reported table. The gains are not all huge, but they are consistent. bge-small-en-v1.5 gains 0.1186 in P@5. multilingual-e5-large-instruct gains 0.0705 in P@3. gte-Qwen2-7B-instruct gains 0.0670 in P@3.
This supports the data argument: ETF-only supervision is grounded but incomplete. TRS broadens coverage and gives the model more balanced theme exposure. For a commercial system, this is the difference between “we cover themes that already have ETFs” and “we can support themes users actually type.”
What the paper directly shows, and what Cognaptus infers
A clean business reading needs separation between evidence and inference. Otherwise every research paper becomes a vendor deck with equations.
| Layer | What the paper shows | What Cognaptus infers for business use | What remains uncertain |
|---|---|---|---|
| Retrieval | THEME improves HR@k and P@k across many embedding backbones and beats reported GPT-4.1/Gemini retrieval baselines on key metrics | A domain-tuned retrieval layer can improve thematic universe construction | Whether labels match a given firm’s internal definition of theme relevance |
| Portfolio utility | Equal-weighted top-k portfolios improve metrics for several backbones over the 2024–2025 test period | Better candidate ranking can feed active thematic rebalancing | Net performance after costs, turnover, taxes, liquidity limits, and benchmark constraints |
| Dataset design | TRS beats ETF-only training in the ablation | Firms should invest in theme taxonomies and labelled supervision, not just model choice | How TRS quality scales across regions, sectors, and less-covered markets |
| Anchor strategy | Theme descriptions as anchors beat stock-stock alignment | User-facing query language should shape the embedding geometry | How well this generalises to entirely novel or highly ambiguous themes |
| System design | The paper sketches cloud-native APIs and modular integration | THEME is best understood as infrastructure for screeners, PM tools, index design, and personalised portfolios | Production governance, auditability, and drift monitoring are not fully tested |
The strongest business use case is candidate generation. THEME can sit upstream of a portfolio optimiser, analyst workbench, or thematic index builder. A user enters a theme. The system retrieves a top-k set of companies. The investment platform then applies the less glamorous but necessary filters: liquidity, market cap, volatility, regional constraints, factor exposure, concentration, sanctions, ESG policy, and suitability.
For retail platforms, the same mechanism could power personalised thematic baskets. The product risk is obvious: users may treat a theme query as investment advice. The safer product architecture is to present THEME output as a research universe, not as “buy these 10 stocks because the vector said so.” The vector, despite its confidence, does not have fiduciary liability. Convenient creature.
For asset managers, the value is workflow compression. Analysts already translate narratives into stock lists. THEME automates part of that mapping, makes it refreshable, and gives a measurable retrieval benchmark. It does not replace judgement; it reduces the manual search surface.
For thematic index builders, the value is methodology. A theme-aware embedding space can provide a repeatable way to propose constituents, monitor drift, and justify inclusion logic. That still needs a rules framework, because “cosine similarity” is not a prospectus.
The production version needs controls the paper does not test
The paper is useful because it is specific. Its limitations should be just as specific.
First, the tested universe is U.S. equities, with supervision drawn from thematic ETFs and augmented TRS data. That does not automatically transfer to emerging markets, ADR-heavy universes, private companies, bonds, crypto assets, or multi-asset thematic baskets. The language and disclosure quality differ. So does liquidity. So does the amount of nonsense in the data, and finance is already an ambitious baseline.
Second, the portfolio test covers April 23, 2024 to April 29, 2025. That is a meaningful rolling evaluation window, but still one market period. It does not establish regime robustness across rate cycles, crisis drawdowns, inflation shocks, sector rotations, or theme bubbles.
Third, the portfolios are equal-weighted top-k baskets based on similarity ranking. That is a clean test of retrieval utility, not a production portfolio construction process. Real portfolios need turnover controls, transaction costs, borrow constraints if shorting is involved, liquidity caps, tracking-error limits, tax awareness, and position-level risk budgeting.
Fourth, the paper does not present a full cost-adjusted trading simulation. The temporal adapter uses recent returns and forward-return labels during training, then the backtest uses rolling forward returns. That is acceptable as an experimental design, but a production implementation must be obsessive about time splits, data availability timestamps, rebalance timing, and survivorship bias. In finance, “almost time-clean” is just another way to say “expensive later.”
Fifth, ETF-derived supervision is both strength and weakness. ETF holdings reflect real investment products, but also product-market incentives, provider methodology, and popularity bias. TRS reduces this limitation; it does not abolish it. A firm deploying this internally would need its own theme ontology, human review loops, and drift monitoring.
Finally, the paper’s discussion points toward future extensions such as ESG events, earnings calls, patent filings, real-time inference, and event-driven rebalancing. Those are plausible extensions, not demonstrated results. Useful, yes. Proven, no.
The real value is disciplined thematic retrieval
The wrong headline is: “AI learns to invest in themes.”
The better headline is less glamorous and more valuable: “Domain-tuned embeddings can make thematic stock retrieval measurable, refreshable, and partly return-aware.”
That distinction matters. Thematic investing has always had a translation problem. Investors think in narratives. Markets trade companies. The bridge between the two is usually a mess of ETF holdings, sector screens, analyst memory, and marketing copy. THEME proposes a cleaner bridge: train the representation space so that theme descriptions, stock profiles, and recent return dynamics interact in a structured way.
The retrieval evidence is strong. The portfolio evidence is promising but bounded. The ablations are especially useful because they identify which design choices matter: use theme descriptions as anchors, and do not rely only on ETF holdings if you want broader coverage.
For operators, the next step is not to hand THEME the keys to the portfolio. The next step is to treat it as a thematic search layer: generate the candidate universe, explain why names are in scope, refresh rankings on a controlled cadence, and pass the output into risk-aware portfolio construction.
That is less dramatic than “LLM stock picker.” It is also much closer to something a serious investment organisation could actually use without immediately summoning compliance.
Cognaptus: Automate the Present, Incubate the Future.
-
Hoyoung Lee, Wonbin Ahn, Suhwan Park, Jaehoon Lee, Minjae Kim, Sungdong Yoo, Taeyoon Lim, Woohyung Lim, and Yongjae Lee, “THEME: Enhancing Thematic Investing with Semantic Stock Representations and Temporal Dynamics,” arXiv:2508.16936, 2025. ↩︎