TL;DR for operators
QuantBench is not another paper asking investors to believe that the newest neural architecture will finally decode markets because it has more layers and a nicer diagram. Mercifully. It is a benchmark platform for quantitative investment that tries to evaluate AI methods across the full quant workflow: factor mining, modelling, end-to-end position generation, portfolio optimisation, and order execution.1
The practical finding is sharper than the benchmark packaging. Deep neural networks often fit prediction targets better, especially on Information Coefficient (IC), but that does not reliably convert into better returns or Sharpe ratios. In one direct comparison, LSTM beats XGBoost on IC for both Alpha101 and Alpha158 feature sets, yet XGBoost slightly wins on return and Sharpe under Alpha101. The model with the prettier predictive score is not automatically the model that makes the portfolio behave.
The same pattern repeats elsewhere. Transformers underperform badly in the reported stock-prediction benchmark. Some graph models fail when the graph structure is not a good fit for the task. Adaptive graph models do better, suggesting that relationships between stocks matter, but only when the model learns useful relationships rather than worshipping a static data structure. Adding more information helps only when the integration method is sensible. Fundamentals and news improve performance; graph-encoded industry or Wikidata can improve IC while weakening return.
For investment firms and fintech builders, the business lesson is not “use trees” or “avoid deep learning”. That would be too easy, and therefore suspicious. The lesson is: benchmark the entire decision chain. Evaluate losses against portfolio outcomes. Monitor alpha decay. Treat validation design as a live source of model risk. Use ensembling to reduce variance, but do not mistake averaging for robustness engineering. QuantBench is valuable because it turns quant AI from architecture theatre into operational diagnosis.
The boundary is equally important. These are benchmark and backtest results, not a live-trading licence. QuantBench supports better model selection, research governance, and infrastructure decisions. It does not prove that any included strategy will survive transaction costs, crowding, regime shifts, capacity limits, or the charming little habit markets have of changing once everyone notices the same pattern.
The quant model beauty contest is testing the wrong thing
A familiar AI investment story goes like this: take more data, apply a more expressive model, discover subtler patterns, harvest better returns. The plot is neat. The market, regrettably, has not agreed to act inside that plot.
QuantBench matters because it attacks the weak joint in that story. In quant investing, prediction quality is only an intermediate artefact. A model emits scores or forecasts; those scores become positions; positions become trades; trades face liquidity, cost, timing, turnover, and risk. Somewhere along that chain, a statistically attractive signal can become an unimpressive portfolio. This is not a philosophical inconvenience. It is the job.
The authors therefore organise QuantBench around the actual quant pipeline rather than around a single modelling task. The benchmark covers factor mining, modelling, end-to-end modelling, portfolio optimisation, and order execution. Each task has different inputs, outputs, objectives, feedback signals, and evaluation metrics. Factor mining outputs features and is evaluated as a signal problem. Modelling outputs predictions and is usually trained through regression, classification, or ranking. Portfolio optimisation turns predictions into positions. Execution turns desired positions into trades and is evaluated through slippage, market impact, and related costs.
That structure is the paper’s first contribution. It refuses to let “model accuracy” impersonate “investment performance”. In many AI domains, a benchmark can reward the model output directly: label accuracy, BLEU score, answer correctness, segmentation overlap, whatever today’s preferred scoreboard happens to be. Quant investing is more annoying. The useful object is a decision under uncertainty, not a prediction in isolation.
QuantBench’s second contribution is the data layer. It builds breadth across market data, fundamentals, relational data, and news; and depth across frequencies from quarterly financial statements to tick-level trades and quotes. The stock universes cover China, the US, Hong Kong, and the UK, with market data starting from roughly 2003–2006 and cut off at May 2024. Relational data include Wikidata-derived links and industry graphs, with attention to temporal snapshots to avoid future-information leakage.
The third contribution is empirical: once models are compared under a more consistent pipeline, the glamorous hierarchy collapses. More expressive models do not reliably dominate. More data does not automatically help. Graphs are not magic. Loss functions matter. Validation design matters. Rolling retraining matters. Ensembling helps, but partly because the underlying signals are unstable enough to need it. A benchmark, in other words, becomes useful when it embarrasses simple narratives.
IC is not P&L, and the difference is where money goes to sulk
The paper’s most useful early comparison is deliberately plain: XGBoost versus LSTM on Chinese stock data, using two feature sets, Alpha101 and Alpha158. This is main evidence, not a decorative side test. It asks whether a deep neural network’s stronger fitting ability translates into better investment outcomes when evaluated through a ranking-based stock-selection backtest.
The answer is: not reliably.
| Feature set | Model | IC | Return | Sharpe ratio | What the result says |
|---|---|---|---|---|---|
| Alpha101 | XGBoost | 2.31% | 24.58% | 0.8093 | Lower IC, slightly better portfolio outcome |
| Alpha101 | LSTM | 4.76% | 24.25% | 0.7741 | Higher IC, slightly weaker return and Sharpe |
| Alpha158 | XGBoost | 2.53% | 20.31% | 0.6407 | Lower IC and weaker portfolio result |
| Alpha158 | LSTM | 5.95% | 23.76% | 0.7561 | Higher IC and better portfolio result |
This table is awkward in exactly the right way. With Alpha158, the LSTM wins on both prediction and portfolio metrics. With Alpha101, the LSTM still wins on IC but loses slightly on return and Sharpe. The paper interprets this as evidence that tree models may perform better when feature sets already contain strong predictive structure, possibly because trees handle tabular engineered features with less overfitting. DNNs can capture more complex relationships, but complexity is not free. It can spend signal budget on patterns that do not survive the conversion from score to portfolio.
For operators, the lesson is not “XGBoost forever”. That would be a very 2018 kind of comfort blanket. The lesson is that IC should be treated as a diagnostic, not a contract. IC measures correlation between the model’s signal and future returns. Portfolio return and Sharpe depend on position construction, concentration, drawdown behaviour, turnover, and the distribution of forecast errors where capital is actually allocated.
A model can improve average ranking quality while worsening the tails that matter to a portfolio. It can also improve prediction on many names that receive little capital while failing on the names selected into the book. It can raise IC by becoming better at weak signals across the cross-section, yet fail to improve the top tranche enough to beat a simpler model. This is where investment AI often loses money politely: not through obvious failure, but through metric misalignment.
The operational replacement is straightforward. For every candidate model, evaluate at least three layers: signal quality, portfolio behaviour, and execution sensitivity. IC answers “is there predictive association?” Return asks “does the strategy monetise it?” Sharpe asks “does it monetise it with tolerable variance?” Turnover and drawdown then ask whether the firm can live with the answer.
Newer architectures do not get a participation trophy
The broad model comparison is the paper’s architectural stress test. It uses US stock data with volume-price and fundamental features, trains models using IC loss, incorporates Wikidata relational information where relevant, and evaluates models under a shared backtesting setup.
The results are not kind to architectural fashion.
Vanilla RNNs perform well. GRU reports an IC of 4.12%, return of 61.36%, and Sharpe of 3.4433. ALSTM reports the strongest IC among the adapted RNN group at 4.22%, though its return and Sharpe remain close to the RNN family. TCN performs respectably, with IC of 3.86%, return of 54.72%, and Sharpe of 2.9729. TFT also performs strongly, with IC of 3.99%, return of 54.31%, and Sharpe of 3.2324.
Then come the Transformers. Autoformer reports IC of 0.08%, return of -8.82%, and Sharpe of -0.7467. FEDformer is similarly poor. Pyraformer and PatchTST are positive but weak relative to the better recurrent, tabular, and adaptive graph models. The point is not that Transformers are useless in finance. The point is narrower and more useful: Transformer architectures designed for time-series modelling do not automatically map well into stock prediction under this benchmark.
The graph story is also mixed. GCN performs badly, with negative IC and return. GAT performs much better, reporting IC of 3.90%, return of 58.07%, and Sharpe of 3.0730. RGCN performs reasonably, with IC of 3.78%, return of 49.58%, and Sharpe of 2.7706. Hypergraph models such as STHAN and STHGCN perform poorly in this setup. Adaptive graph models are the interesting exception: THGNN reports IC of 4.93%, return of 65.04%, and Sharpe of 3.3184.
The comparison matters because it separates three claims that are too often fused together:
| Claim | What QuantBench suggests | Business reading |
|---|---|---|
| “Temporal modelling helps.” | Often yes; RNN-family models and TCN perform competitively. | Keep sequence models in the research stack. They are not obsolete just because they are less fashionable. |
| “Transformers dominate sequence tasks.” | Not in this reported stock-prediction setting. | Do not import NLP architecture preferences as investment policy. Markets are not text with tickers. |
| “Relational structure helps.” | Sometimes; static or poorly matched graphs can fail, adaptive graphs do better. | Relationship data needs modelling discipline, not graph-shaped optimism. |
This is a comparison-based result, not a universal ranking of every model family. The benchmark setup matters: data source, stock universe, feature design, training objective, and backtest construction all shape the outcome. Still, the business implication is durable. Architecture selection should be treated as portfolio engineering, not brand selection. The question is not “which model class is advanced?” The question is “which assumptions about time, cross-sectional relationships, noise, and regime shift are being encoded, and are those assumptions useful for this book?”
Static relationships are cheap; useful relationships are learned
The paper’s relational-data findings deserve special attention because they push against an increasingly common fintech reflex: build a knowledge graph, attach a graph neural network, call it intelligence, then wait for investors to applaud.
QuantBench incorporates relational information from sources such as Wikidata and industry classifications. It also takes temporal information into account for relations, which matters because using a relationship that did not exist at the time of prediction would be leakage wearing a blazer.
The experiments show that simply adding graph structure does not guarantee performance improvement. Wikidata inclusion yields minimal improvements in the main model comparison, possibly because such public relationships are already known and priced. Homogeneous graph models such as GCN underperform. Relational GNNs, which distinguish edge types, do better than homogeneous graph models. Adaptive graph models, which learn latent structures from data, perform better still.
This creates a useful hierarchy for business users:
- Known static relationships can be operationally clean but may be economically stale.
- Typed relationships improve representation because supplier links, ownership links, and industry links should not be treated as the same kind of edge.
- Adaptive relationships may capture latent co-movement and changing market structure, but they also increase model-risk burden because the learned graph must be monitored, interpreted, and stress-tested.
The paper does not prove that adaptive graph models are always superior. It shows they perform strongly in the reported benchmark. The broader lesson is more interesting: relational modelling is valuable only when the representation matches the economic mechanism. A graph is not an edge list with ambition. It is a claim about how information, risk, or behaviour travels across assets.
For a quant desk, this means graph-based research should begin with a relationship thesis. Supply chains may matter for shock propagation. Ownership may matter for governance and capital flows. Industry membership may matter for factor exposure. News co-mentions may matter for attention spillovers. Latent learned edges may matter for co-movement. Each thesis needs different validation. If all of them are pushed through the same graph pipeline, the result is not sophistication. It is data plumbing with a conference badge.
Loss functions are investment policy in disguise
The training-objective experiment is an ablation-style test. It changes the learning objective while holding the broader setup comparable, using LSTM, RGCN, and DTML with classification loss, IC loss, MSE loss, and pairwise ranking loss. Its purpose is not to crown one universal loss function. It shows how strongly objective choice shapes the eventual investment metric.
The IC loss performs strongly on return and IC. For LSTM, IC loss reports 36.03% return and IC of 3.62%, while classification loss produces a lower return of 19.31% but a higher Sharpe of 1.9757 versus 1.8642. For RGCN, IC loss reports the highest return at 44.97% and IC of 3.97%. For DTML, IC loss again reports the highest return at 38.56% and IC of 3.24%. MSE and ranking losses underperform badly in several cases, with ranking even producing negative returns for all three reported models.
This is not just a technical detail. A loss function is a compressed statement of what the organisation rewards during training. MSE asks the model to approximate return values. Classification asks it to get direction right. Ranking asks it to order assets pairwise. IC loss aligns more directly with the cross-sectional signal metric used in quant selection. Utility maximisation, not the focus of this specific table, would move still closer to portfolio objectives.
When teams choose loss functions casually, they are not making a harmless engineering choice. They are choosing which mistakes the model is allowed to care about. In a top-ranked stock-selection strategy, getting the middle of the cross-section slightly better may not matter much. Misranking the names that enter the portfolio matters. In a risk-constrained strategy, return prediction without volatility awareness can be less useful than a weaker signal with better drawdown behaviour.
The paper’s business message is therefore blunt: model governance should include objective governance. Every research report should state not only which architecture was used, but why the loss function matches the investment decision. If that paragraph is missing, the model may still work. So may a vending machine kicked at the correct angle. Neither should be called a process.
Alpha decay turns model maintenance into a first-class cost
The alpha-decay experiment is a sensitivity and operational-maintenance test. It evaluates walk-forward updating schemes with rolling steps of 3, 6, and 12 months, plus a no-rolling baseline, using US stock data from 2021 to 2023 and the S&P 500 as benchmark.
The visual result is clear: the 3-month rolling scheme performs best, while the no-rolling approach performs worst. More frequent retraining helps address distribution shift and slows alpha decay. The catch is also clear: more frequent retraining increases compute and operational cost.
This is where the paper becomes directly useful to managers rather than just researchers. In production, a model is not a one-off artefact. It is an asset with maintenance capex. If alpha decays quickly, then retraining cadence is not a back-office scheduling issue. It becomes part of the investment strategy.
The trade-off can be stated simply:
| Decision | Benefit | Cost | Governance question |
|---|---|---|---|
| Frequent rolling retraining | Slower decay; better adaptation to market evolution | More compute, more validation, more deployment risk | What is the marginal performance gain per retraining cycle? |
| Infrequent retraining | Lower infrastructure and review burden | Faster model staleness | How quickly does the signal decay after deployment? |
| No rolling update | Operational simplicity | Worst reported performance in this test | Is simplicity being confused with stability? |
The paper points toward continual learning and online learning as future research directions. Cognaptus’ business inference is narrower: firms need an alpha-maintenance budget. Every AI strategy should specify how often the model is expected to degrade, how that decay is monitored, what triggers retraining, and who is accountable for approving the refreshed model.
Without that, “AI-driven investing” becomes a strange ritual: train once, deploy proudly, and watch the market quietly walk away.
Validation design is not admin work; it is model risk
The hyperparameter-tuning experiment is a robustness and implementation test. It compares validation-set construction methods: tail, random, and fragmented. It also compares normal training against retraining on the full dataset after hyperparameters are selected.
The result complicates the usual assumption that the validation segment immediately before the test set is always best. Random validation outperforms tail validation in both normal and retrain settings. Under normal training, random validation reports IC of 3.86%, return of 36.84%, and Sharpe of 1.8969, compared with tail validation at IC of 3.29%, return of 29.87%, and Sharpe of 1.5454. Under retraining, random validation again leads, with return of 41.36% and Sharpe of 1.9334.
Fragmented validation performs poorly under normal training but improves under retraining. The authors suggest that fragmented validation may produce more stable hyperparameters, though it needs further refinement.
The operational lesson is that validation is not a checkbox. In non-stationary markets, validation design encodes assumptions about which historical conditions matter for future deployment. A tail validation set assumes the most recent past is most representative. A random validation set exposes the model to more diverse regimes. A fragmented design tries to sample across history while preserving some temporal spread. Each choice is a market hypothesis.
A good quant review should therefore ask:
| Validation choice | Hidden assumption | Failure mode |
|---|---|---|
| Tail validation | Recent history best represents near future | Overfits to the latest regime |
| Random validation | Diverse past patterns improve hyperparameter choice | Can blur temporal dependence if used carelessly |
| Fragmented validation | Stability comes from sampled historical fragments | May underperform unless retraining and segmentation are designed well |
The paper does not settle the validation problem. It usefully makes it harder to ignore. That is enough.
Ensembles help because the signal is fragile, not because averaging is magic
The ensemble experiment is a robustness test against noisy optimisation. The authors train an MLP-Mixer model over 40 repeated runs with different random seeds on US volume-price data, average the predictions, and backtest the ensemble. Figure 5 shows substantial variance across runs, while the ensemble reduces variance and improves robustness.
The appendix extends this logic by showing low prediction correlation among different models on CSI300, suggesting potential for ensembling across model families as well as across repeated runs of the same architecture.
The mechanism is intuitive. In a low signal-to-noise environment, different random seeds can push the same model toward different noisy patterns. Averaging predictions can cancel some idiosyncratic errors. But this should not be oversold. Ensembling reduces variance; it does not manufacture economic signal from nothing. It can make a fragile process less fragile. It cannot make a bad thesis good.
For investment operations, ensembling has three uses:
| Use | Practical value | Boundary |
|---|---|---|
| Same-model repeated runs | Reduces seed sensitivity | May hide instability rather than solve it |
| Cross-model ensembles | Captures diverse views when predictions are weakly correlated | Requires correlation monitoring and allocation logic |
| Strategy-level ensembles | Diversifies across signals or books | Can increase complexity, turnover, and governance burden |
The best interpretation is that ensemble performance is a diagnostic. If a single model varies wildly by seed, the team has learned something uncomfortable about the signal environment. The ensemble may be useful, but the variance itself should trigger investigation. Robustness should not be outsourced to averaging and a hopeful spreadsheet.
More data helps only after it survives integration
The appendix experiment on information sources is an exploratory extension, not the main thesis. It asks whether adding volume-price (VP), fundamentals (F), news (N), industry (I), and Wikidata (W) improves model performance for a DNN and XGBoost on US stock data.
The answer: yes, but not monotonically, and not regardless of representation.
For the DNN, raw volume-price data alone produces IC of 3.43% but negative return and Sharpe. Adding fundamentals improves return sharply to 30.89% and Sharpe to 1.5592. Adding news improves IC to 4.07%, return to 31.97%, and Sharpe to 1.6825. But adding industry as graph information reduces IC to 1.97%, return to 13.51%, and Sharpe to 0.9173. Adding Wikidata gives IC of 3.80% but return of 18.06%, below the VPFN result.
For XGBoost, VP alone is weak and negative. Adding fundamentals improves IC but return remains negative. Adding industry finally produces a positive return of 1.74% and Sharpe of 0.0833, still modest. The VPFNW setting is not reported for XGBoost.
This appendix result is useful because it cuts against another easy belief: “just add alternative data”. Data breadth helps only if the model can ingest the information in a form that preserves economic meaning. Fundamentals and news appear useful in the reported setting. Graph-structured industry information hurts the DNN result, which the authors attribute to limitations of RGCN in that context. Wikidata improves IC but reduces returns, again reminding us that prediction metrics and monetisation can diverge.
The business inference is not “buy less data”. It is “test data integration as a model component”. A new dataset should be evaluated through incremental contribution, representation choice, leakage controls, stability, turnover impact, and downstream portfolio value. Otherwise, the firm may simply be purchasing more expensive noise.
What each experiment supports, and what it does not
A benchmark paper can easily become a buffet of tables. Useful, but dangerous if readers start turning every row into a law of markets. The experiments in QuantBench have different roles.
| Paper component | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| XGBoost vs LSTM on Alpha101/Alpha158 | Main comparative evidence | Better IC does not always mean better portfolio metrics; engineered features and model class interact | Trees always beat neural networks |
| Broad architecture comparison | Main evidence across model families | RNNs, tabular models, and adaptive graph models can outperform more fashionable architectures in this setup | A permanent ranking of architectures |
| Training-objective comparison | Ablation-style objective test | Loss choice materially affects return, Sharpe, and IC | IC loss is universally optimal |
| Rolling retraining schemes | Sensitivity and maintenance test | Frequent updates reduce alpha decay in the reported period | Three-month retraining is always the right cadence |
| Validation-set selection | Robustness and implementation test | Validation design changes model outcomes and should be governed | Random validation is always superior |
| Ensemble experiment | Robustness test | Averaging reduces variance from noisy model fits | Ensembling solves weak signals |
| Model-correlation appendix | Exploratory ensemble rationale | Low model correlation suggests diversification potential | Any low-correlation model improves portfolio quality |
| Information-source appendix | Exploratory data-integration extension | More data can help, but representation matters | Alternative or relational data automatically improve returns |
| Dataset and limitation appendices | Implementation detail and boundary setting | Coverage is broad but still incomplete; backtesting realism can improve | The benchmark captures every real-world trading constraint |
This table is the sober reading. Less fun than declaring a winner, but substantially cheaper than discovering the boundary in production.
The business value is an evaluation operating system
QuantBench’s business relevance is not that a CIO can read the paper, pick THGNN, retrain every three months, add fundamentals and news, ensemble everything, and then go home early. That would be convenient. Markets dislike convenience.
The value is that QuantBench provides a structure for asking better questions:
- Does the chosen training objective match the investment decision?
- Does predictive improvement survive portfolio construction?
- Does the model degrade under market evolution, and how fast?
- Does validation design reflect plausible deployment conditions?
- Does relational data add information or merely decorate the feature space?
- Does more data improve returns after integration costs and noise?
- Does ensembling reduce variance because models are genuinely diverse, or because individual runs are unstable?
- Are evaluation metrics task-specific enough for the actual use case?
For asset managers, this supports model selection and research governance. For fintech platforms, it supports product claims that are less embarrassing under due diligence. For AI-investing vendors, it provides a way to demonstrate process quality without pretending that a backtest is a prophecy. For internal quant teams, it helps allocate R&D budget: spend less time arguing about architecture prestige, more time measuring decay, objective alignment, validation stability, and data contribution.
A useful operational framework would look like this:
| Operating layer | QuantBench lesson | Business action |
|---|---|---|
| Data | Breadth and depth matter, but integration matters more | Run incremental data-value tests before scaling vendor spend |
| Model | Architecture rankings are context-dependent | Maintain a model zoo, but select by portfolio outcome |
| Objective | Loss functions shape investment behaviour | Require objective-to-strategy justification in model review |
| Validation | Sampling design changes conclusions | Treat validation design as model-risk policy |
| Deployment | Alpha decays under distribution shift | Define retraining cadence and monitoring triggers |
| Robustness | Low signal-to-noise creates seed and model variance | Use ensembles, but investigate instability rather than hiding it |
| Evaluation | Signal, portfolio, and execution metrics differ | Report the full chain, not just IC |
This is dull in the way good infrastructure is dull. It is also where most of the money is.
The limits are real, and they matter operationally
The paper is careful about future work. QuantBench still needs broader alternative data, deeper granularity such as order flow, more model architectures, more innovative formulations, and more realistic and efficient backtesting. These are not cosmetic gaps.
Order flow matters because many strategies die between portfolio intent and trade execution. Backtesting realism matters because slippage, liquidity, market impact, commissions, borrow constraints, and capacity can change rankings across strategies. Alternative data matters because some signals are not visible in standard volume-price or fundamental fields. Newer architectures matter because the model universe is moving quickly, and today’s benchmark suite will not stay complete for long.
There is also a broader interpretation boundary. The empirical results are based on selected markets, datasets, feature sets, periods, and backtest assumptions. That makes them useful for comparative diagnosis, not for universal investment claims. A model that underperforms in QuantBench may still work in a specialised strategy with different data, costs, horizons, constraints, or execution logic. A model that performs well in QuantBench may fail live once capacity, crowding, latency, and regime change arrive with their usual lack of manners.
So the right conclusion is not that QuantBench has solved quant AI evaluation. It has made the evaluation problem harder to evade. That is progress.
Conclusion: the benchmark is the product
The most important part of QuantBench is not the winning model. It is the benchmark discipline that makes “winning” a more meaningful word.
The paper’s comparisons repeatedly undermine the easy story that quant AI advances by stacking more expressive models on wider datasets. DNNs can improve IC without improving portfolio outcomes. Transformers can underperform older sequence models. Graphs can help or harm depending on representation. More information can improve results, then damage them when integrated poorly. Frequent retraining can slow alpha decay, but it adds cost. Ensembles can reduce variance, but only because the underlying environment is noisy enough to make variance a problem in the first place.
For business users, this turns QuantBench into an operating philosophy. Evaluate the whole chain. Match objectives to decisions. Treat data integration as a model-design problem. Monitor decay. Govern validation. Use ensembles as robustness tools, not as superstition. Above all, stop confusing architecture novelty with investment edge.
Trees still matter. Deep roots still matter. But in quant AI, the forest is the benchmark.
Cognaptus: Automate the Present, Incubate the Future.
-
Saizhuo Wang, Hao Kong, Jiadong Guo, Fengrui Hua, Yiyan Qi, Wanyun Zhou, Jiahao Zheng, Xinyu Wang, Lionel M. Ni, and Jian Guo, “QuantBench: Benchmarking AI Methods for Quantitative Investment,” arXiv:2504.18600, 2025, https://arxiv.org/abs/2504.18600. ↩︎