In the age of Transformers and neural nets that write poetry, it’s tempting to assume deep learning dominates every corner of AI. But in quantitative investing, the roots tell a different story. A recent paper—QuantBench: Benchmarking AI Methods for Quantitative Investment1—delivers a grounded reminder: tree-based models still outperform deep learning (DL) methods across key financial prediction tasks.
XGBoost: Still the Evergreen in the Quant Forest
Let’s start with the basics. Tree-based models like XGBoost (Extreme Gradient Boosting) work by building many decision trees and combining their outputs. Each tree is a set of yes/no questions—“Is the stock’s 7-day return > 2%?”—and each new tree learns to correct the mistakes of the previous ones.
Mathematically, this is an ensemble method that minimizes loss via gradient descent over additive functions.
$$ \min_{F_m} \sum_{i=1}^n L(y_i, F_{m-1}(x_i) + f_m(x_i)) $$
Where $F_m$ is the ensemble after $m$ steps, $f_m$ is the new tree added at step $m$, and $L$ is the loss function.
In practical terms? Think of it as a wise committee where each new member only speaks up when earlier ones got it wrong. It’s efficient, robust to noisy features, and handles tabular data like financial time series superbly.

Figure 1: Overview of QuantBench architecture and pipeline
What About Deep Learning?
Deep Learning sounds flashier—and in many fields, it is. It includes:
-
RNNs (Recurrent Neural Networks): These models maintain a hidden state that updates with each timestep, allowing them to remember past inputs. This makes them suited for sequential data, though they often struggle with long-term dependencies and noisy financial series. Mathematically, RNNs compute $h_t = \sigma(Wh_{t-1} + Ux_t + b)$, where $h_t$ is the hidden state and $x_t$ is input at time $t$.
-
GNNs (Graph Neural Networks): GNNs operate on graph-structured data by aggregating and transforming information from a node’s neighbors, enabling models to learn from relational structures like stock co-movement graphs. At each layer, node $v$’s representation is updated via $h_v’ = \text{ReLU}(W \cdot \text{AGG}({h_u | u \in N(v)}))$.
-
Transformers: They rely on self-attention mechanisms that allow every input element to weigh every other input element, capturing long-range dependencies without recurrence. The self-attention is defined as $\text{Attention}(Q,K,V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V$, which excels in capturing global context.
So why do these models often underperform in finance? Because markets are noisy, low-signal environments where generalization matters more than abstraction. Tree models like XGBoost remain resilient where DL can easily overfit.

Figure 2: Data processing pipeline supported by QuantBench
What QuantBench Benchmarked
QuantBench compares over 10 models across four stock markets (US, CN, HK, UK) and multiple tasks:
- Return prediction: forecasting next-day or next-week stock movement.
- Risk-adjusted return: combining return with volatility metrics like Sharpe ratio.
It evaluates raw integration (just concatenating features) versus graph-based integration (e.g., building a knowledge graph from news or firm data). The latter is more structured but also harder to get right.
A knowledge graph is a structured network where entities (e.g., firms, events) are nodes, and relationships (e.g., sector affiliation, supply chain, or joint ventures) are edges. Mathematically, it’s represented as $G = (V, E)$, with a feature matrix $X \in \mathbb{R}^{|V| \times d}$ and an adjacency matrix $A \in {0,1}^{|V| \times |V|}$. Information can be propagated via layers such as $H^{(l+1)} = \sigma(AH^{(l)}W^{(l)})$. For example, if company A is a supplier of company B, their link might allow the model to infer risk contagion or supply shocks from A to B.
In many cases, raw methods + tree models still win.

Figure 3: Model landscape and evolution in QuantBench
Summary of Temporal and Spatial Models in QuantBench
Category | Model Name | Reference |
---|---|---|
Tree-based | XGBoost, LightGBM, CatBoost | Chen & Guestrin (2016), Ke et al. (2017), Prokhorenkova et al. (2018) |
RNN-based | LSTM, SFM, DA-RNN, Hawkes-GRU | Hochreiter & Schmidhuber (1997), Zhang et al. (2017), Qin et al. (2017), Sawhney et al. (2021a) |
CNN/MLP | TCN, MLP-Mixer | Bai et al. (2018), Tolstikhin et al. (2021) |
Transformer | Informer, Autoformer, FEDFormer, PatchTST | Zhou et al. (2021), Wu et al. (2022), Zhou et al. (2022), Nie et al. (2022) |
GNN | GCN, GAT | Kipf & Welling (2017), Velickovic et al. (2018) |
Hetero-GNN | RGCN, RSR | Schlichtkrull et al. (2018), Feng et al. (2019) |
Hypergraph | ESTIMATE, STHCN, STHAN | Huynh et al. (2022), Sawhney et al. (2020, 2021b) |
Backtesting Isn’t Enough
QuantBench critiques the overreliance on simplistic backtesting setups. You can have a model that looks great in historical returns but fails catastrophically in live trading.

Figure 4: Comparison of different rolling schemes used in evaluation
A toy backtest setup might look like:
for t in range(train_end, test_end):
prediction = model.predict(X[t-lookback:t])
pnl[t] = prediction * returns[t] # assumes full position with no market impact
This is too simplistic. A more realistic approach would include rolling retraining, transaction costs, and delayed signal execution:
for window in rolling_windows:
model.fit(X[window.train])
preds = model.predict(X[window.test])
for i, t in enumerate(window.test):
exec_price = simulate_execution(preds[i], market_data[t], delay=1)
cost = estimate_transaction_cost(exec_price, market_conditions[t])
pnl[t] = (exec_price - market_data[t]['open']) * position_size - cost
Still, this omits market impact, latency, microstructure effects, and assumes cost functions that may not scale. QuantBench urges backtests to reflect real-world conditions: signal delay, sector constraints, transaction costs, and rolling portfolio effects.
This aligns with what we emphasized in Agents in Formation: Finetune Meets Finestructure in Quant AI—finance is not just another benchmark; it’s a battleground where generalization is alpha.
That is, your model isn’t just solving a task—it’s competing in a dynamic, adversarial ecosystem. Like a chess engine playing against other engines, not a static puzzle. In this context, robustness beats elegance. Reusability beats sophistication.
Why This Still Matters in an Agentic World
You might wonder: if LLM-based agents are the future, why care about whether trees still beat DL?
Because even agentic AI—as covered in From GenAI to Agentic AI2 and Agentic Agents: A Comprehensive Survey3—relies on sound model choices beneath the agent’s planning and memory layers.
An AI agent that recommends trades or rebalances a portfolio still needs accurate signals at the base. And if that signal comes from an XGBoost forest instead of a 12-layer Transformer, so be it.
Tree-based and DL models are domain-specific intelligence components within a broader agentic framework. Just as a robotic arm needs a reliable gripper, an agentic system needs dependable submodels. We shouldn’t override domain-specific reliability with fashionable architectures unless the upgrade is empirically better.
The Road Ahead: Hybrid Minds, Smarter Bets
None of this is to say deep learning is useless. It shines when fusing image, text, and graph data. But tree-based methods remain the quantitative backbone—and smart agentic systems will know when to delegate.

Figure 5: Ensemble learning curve with variance bands under different rolling settings
As argued in Overqualified, Underprepared, reasoning alone won’t save your portfolio. Your model—whether a language agent or a decision tree—needs to know what matters.
References
Chen, T. and Guestrin, C., 2016. XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD, pp.785–794.
Ke, G. et al., 2017. LightGBM: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems, 30.
Prokhorenkova, L. et al., 2018. CatBoost: unbiased boosting with categorical features. Advances in Neural Information Processing Systems, 31.
Hochreiter, S. and Schmidhuber, J., 1997. Long short-term memory. Neural Computation, 9(8), pp.1735–1780.
Zhou, H. et al., 2021. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. AAAI, 35(12).
Wu, H. et al., 2022. Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting. arXiv preprint.
Nie, Y. et al., 2022. PatchTST: A Time Series is Worth 64 Words. arXiv preprint.
Kipf, T.N. and Welling, M., 2017. Semi-supervised classification with graph convolutional networks. arXiv preprint.
Velickovic, P. et al., 2018. Graph Attention Networks. arXiv preprint.
Schlichtkrull, M. et al., 2018. Modeling relational data with graph convolutional networks. The Semantic Web.
Feng, F. et al., 2019. Temporal relational ranking for stock prediction. ACM Transactions on Information Systems, 37(2).
Huynh, T.T. et al., 2022. Efficient integration of multi-order dynamics and internal dynamics in stock movement prediction. arXiv preprint.
Sawhney, R. et al., 2020. Spatiotemporal hypergraph convolution network for stock movement forecasting. ICDM.
Sawhney, R. et al., 2021. Stock selection via spatiotemporal hypergraph attention network. AAAI, 35(1).
Cognaptus: Automate the Present, Incubate the Future
-
QuantBench: Benchmarking AI Methods for Quantitative Investment. https://arxiv.org/abs/2504.18600 ↩︎
-
From GenAI to Agentic AI: Capabilities, Components, and Challenges. https://arxiv.org/abs/2504.18875 ↩︎
-
Agentic Agents: A Comprehensive Survey of LLM-based Agents. https://arxiv.org/abs/2504.19678 ↩︎