Branching Out, Beating Down: Why Trees Still Outgrow Deep Roots in Quant AI

TL;DR for operators

QuantBench is not another paper asking investors to believe that the newest neural architecture will finally decode markets because it has more layers and a nicer diagram. Mercifully. It is a benchmark platform for quantitative investment that tries to evaluate AI methods across the full quant workflow: factor mining, modelling, end-to-end position generation, portfolio optimisation, and order execution.¹

The practical finding is sharper than the benchmark packaging. Deep neural networks often fit prediction targets better, especially on Information Coefficient (IC), but that does not reliably convert into better returns or Sharpe ratios. In one direct comparison, LSTM beats XGBoost on IC for both Alpha101 and Alpha158 feature sets, yet XGBoost slightly wins on return and Sharpe under Alpha101. The model with the prettier predictive score is not automatically the model that makes the portfolio behave.

The same pattern repeats elsewhere. Transformers underperform badly in the reported stock-prediction benchmark. Some graph models fail when the graph structure is not a good fit for the task. Adaptive graph models do better, suggesting that relationships between stocks matter, but only when the model learns useful relationships rather than worshipping a static data structure. Adding more information helps only when the integration method is sensible. Fundamentals and news improve performance; graph-encoded industry or Wikidata can improve IC while weakening return.

For investment firms and fintech builders, the business lesson is not “use trees” or “avoid deep learning”. That would be too easy, and therefore suspicious. The lesson is: benchmark the entire decision chain. Evaluate losses against portfolio outcomes. Monitor alpha decay. Treat validation design as a live source of model risk. Use ensembling to reduce variance, but do not mistake averaging for robustness engineering. QuantBench is valuable because it turns quant AI from architecture theatre into operational diagnosis.

The boundary is equally important. These are benchmark and backtest results, not a live-trading licence. QuantBench supports better model selection, research governance, and infrastructure decisions. It does not prove that any included strategy will survive transaction costs, crowding, regime shifts, capacity limits, or the charming little habit markets have of changing once everyone notices the same pattern.

The quant model beauty contest is testing the wrong thing

A familiar AI investment story goes like this: take more data, apply a more expressive model, discover subtler patterns, harvest better returns. The plot is neat. The market, regrettably, has not agreed to act inside that plot.

QuantBench matters because it attacks the weak joint in that story. In quant investing, prediction quality is only an intermediate artefact. A model emits scores or forecasts; those scores become positions; positions become trades; trades face liquidity, cost, timing, turnover, and risk. Somewhere along that chain, a statistically attractive signal can become an unimpressive portfolio. This is not a philosophical inconvenience. It is the job.

The authors therefore organise QuantBench around the actual quant pipeline rather than around a single modelling task. The benchmark covers factor mining, modelling, end-to-end modelling, portfolio optimisation, and order execution. Each task has different inputs, outputs, objectives, feedback signals, and evaluation metrics. Factor mining outputs features and is evaluated as a signal problem. Modelling outputs predictions and is usually trained through regression, classification, or ranking. Portfolio optimisation turns predictions into positions. Execution turns desired positions into trades and is evaluated through slippage, market impact, and related costs.

That structure is the paper’s first contribution. It refuses to let “model accuracy” impersonate “investment performance”. In many AI domains, a benchmark can reward the model output directly: label accuracy, BLEU score, answer correctness, segmentation overlap, whatever today’s preferred scoreboard happens to be. Quant investing is more annoying. The useful object is a decision under uncertainty, not a prediction in isolation.

QuantBench’s second contribution is the data layer. It builds breadth across market data, fundamentals, relational data, and news; and depth across frequencies from quarterly financial statements to tick-level trades and quotes. The stock universes cover China, the US, Hong Kong, and the UK, with market data starting from roughly 2003–2006 and cut off at May 2024. Relational data include Wikidata-derived links and industry graphs, with attention to temporal snapshots to avoid future-information leakage.

The third contribution is empirical: once models are compared under a more consistent pipeline, the glamorous hierarchy collapses. More expressive models do not reliably dominate. More data does not automatically help. Graphs are not magic. Loss functions matter. Validation design matters. Rolling retraining matters. Ensembling helps, but partly because the underlying signals are unstable enough to need it. A benchmark, in other words, becomes useful when it embarrasses simple narratives.

IC is not P&L, and the difference is where money goes to sulk

The paper’s most useful early comparison is deliberately plain: XGBoost versus LSTM on Chinese stock data, using two feature sets, Alpha101 and Alpha158. This is main evidence, not a decorative side test. It asks whether a deep neural network’s stronger fitting ability translates into better investment outcomes when evaluated through a ranking-based stock-selection backtest.

The answer is: not reliably.

Feature set	Model	IC	Return	Sharpe ratio	What the result says
Alpha101	XGBoost	2.31%	24.58%	0.8093	Lower IC, slightly better portfolio outcome
Alpha101	LSTM	4.76%	24.25%	0.7741	Higher IC, slightly weaker return and Sharpe
Alpha158	XGBoost	2.53%	20.31%	0.6407	Lower IC and weaker portfolio result
Alpha158	LSTM	5.95%	23.76%	0.7561	Higher IC and better portfolio result

This table is awkward in exactly the right way. With Alpha158, the LSTM wins on both prediction and portfolio metrics. With Alpha101, the LSTM still wins on IC but loses slightly on return and Sharpe. The paper interprets this as evidence that tree models may perform better when feature sets already contain strong predictive structure, possibly because trees handle tabular engineered features with less overfitting. DNNs can capture more complex relationships, but complexity is not free. It can spend signal budget on patterns that do not survive the conversion from score to portfolio.

For operators, the lesson is not “XGBoost forever”. That would be a very 2018 kind of comfort blanket. The lesson is that IC should be treated as a diagnostic, not a contract. IC measures correlation between the model’s signal and future returns. Portfolio return and Sharpe depend on position construction, concentration, drawdown behaviour, turnover, and the distribution of forecast errors where capital is actually allocated.

A model can improve average ranking quality while worsening the tails that matter to a portfolio. It can also improve prediction on many names that receive little capital while failing on the names selected into the book. It can raise IC by becoming better at weak signals across the cross-section, yet fail to improve the top tranche enough to beat a simpler model. This is where investment AI often loses money politely: not through obvious failure, but through metric misalignment.

The operational replacement is straightforward. For every candidate model, evaluate at least three layers: signal quality, portfolio behaviour, and execution sensitivity. IC answers “is there predictive association?” Return asks “does the strategy monetise it?” Sharpe asks “does it monetise it with tolerable variance?” Turnover and drawdown then ask whether the firm can live with the answer.

Newer architectures do not get a participation trophy

The broad model comparison is the paper’s architectural stress test. It uses US stock data with volume-price and fundamental features, trains models using IC loss, incorporates Wikidata relational information where relevant, and evaluates models under a shared backtesting setup.

The results are not kind to architectural fashion.

Vanilla RNNs perform well. GRU reports an IC of 4.12%, return of 61.36%, and Sharpe of 3.4433. ALSTM reports the strongest IC among the adapted RNN group at 4.22%, though its return and Sharpe remain close to the RNN family. TCN performs respectably, with IC of 3.86%, return of 54.72%, and Sharpe of 2.9729. TFT also performs strongly, with IC of 3.99%, return of 54.31%, and Sharpe of 3.2324.

Then come the Transformers. Autoformer reports IC of 0.08%, return of -8.82%, and Sharpe of -0.7467. FEDformer is similarly poor. Pyraformer and PatchTST are positive but weak relative to the better recurrent, tabular, and adaptive graph models. The point is not that Transformers are useless in finance. The point is narrower and more useful: Transformer architectures designed for time-series modelling do not automatically map well into stock prediction under this benchmark.

The graph story is also mixed. GCN performs badly, with negative IC and return. GAT performs much better, reporting IC of 3.90%, return of 58.07%, and Sharpe of 3.0730. RGCN performs reasonably, with IC of 3.78%, return of 49.58%, and Sharpe of 2.7706. Hypergraph models such as STHAN and STHGCN perform poorly in this setup. Adaptive graph models are the interesting exception: THGNN reports IC of 4.93%, return of 65.04%, and Sharpe of 3.3184.

The comparison matters because it separates three claims that are too often fused together:

Claim	What QuantBench suggests	Business reading
“Temporal modelling helps.”	Often yes; RNN-family models and TCN perform competitively.	Keep sequence models in the research stack. They are not obsolete just because they are less fashionable.
“Transformers dominate sequence tasks.”	Not in this reported stock-prediction setting.	Do not import NLP architecture preferences as investment policy. Markets are not text with tickers.
“Relational structure helps.”	Sometimes; static or poorly matched graphs can fail, adaptive graphs do better.	Relationship data needs modelling discipline, not graph-shaped optimism.

This is a comparison-based result, not a universal ranking of every model family. The benchmark setup matters: data source, stock universe, feature design, training objective, and backtest construction all shape the outcome. Still, the business implication is durable. Architecture selection should be treated as portfolio engineering, not brand selection. The question is not “which model class is advanced?” The question is “which assumptions about time, cross-sectional relationships, noise, and regime shift are being encoded, and are those assumptions useful for this book?”

Static relationships are cheap; useful relationships are learned

The paper’s relational-data findings deserve special attention because they push against an increasingly common fintech reflex: build a knowledge graph, attach a graph neural network, call it intelligence, then wait for investors to applaud.

QuantBench incorporates relational information from sources such as Wikidata and industry classifications. It also takes temporal information into account for relations, which matters because using a relationship that did not exist at the time of prediction would be leakage wearing a blazer.

The experiments show that simply adding graph structure does not guarantee performance improvement. Wikidata inclusion yields minimal improvements in the main model comparison, possibly because such public relationships are already known and priced. Homogeneous graph models such as GCN underperform. Relational GNNs, which distinguish edge types, do better than homogeneous graph models. Adaptive graph models, which learn latent structures from data, perform better still.

This creates a useful hierarchy for business users:

Known static relationships can be operationally clean but may be economically stale.
Typed relationships improve representation because supplier links, ownership links, and industry links should not be treated as the same kind of edge.
Adaptive relationships may capture latent co-movement and changing market structure, but they also increase model-risk burden because the learned graph must be monitored, interpreted, and stress-tested.

The paper does not prove that adaptive graph models are always superior. It shows they perform strongly in the reported benchmark. The broader lesson is more interesting: relational modelling is valuable only when the representation matches the economic mechanism. A graph is not an edge list with ambition. It is a claim about how information, risk, or behaviour travels across assets.

For a quant desk, this means graph-based research should begin with a relationship thesis. Supply chains may matter for shock propagation. Ownership may matter for governance and capital flows. Industry membership may matter for factor exposure. News co-mentions may matter for attention spillovers. Latent learned edges may matter for co-movement. Each thesis needs different validation. If all of them are pushed through the same graph pipeline, the result is not sophistication. It is data plumbing with a conference badge.

Loss functions are investment policy in disguise

The training-objective experiment is an ablation-style test. It changes the learning objective while holding the broader setup comparable, using LSTM, RGCN, and DTML with classification loss, IC loss, MSE loss, and pairwise ranking loss. Its purpose is not to crown one universal loss function. It shows how strongly objective choice shapes the eventual investment metric.

The IC loss performs strongly on return and IC. For LSTM, IC loss reports 36.03% return and IC of 3.62%, while classification loss produces a lower return of 19.31% but a higher Sharpe of 1.9757 versus 1.8642. For RGCN, IC loss reports the highest return at 44.97% and IC of 3.97%. For DTML, IC loss again reports the highest return at 38.56% and IC of 3.24%. MSE and ranking losses underperform badly in several cases, with ranking even producing negative returns for all three reported models.

This is not just a technical detail. A loss function is a compressed statement of what the organisation rewards during training. MSE asks the model to approximate return values. Classification asks it to get direction right. Ranking asks it to order assets pairwise. IC loss aligns more directly with the cross-sectional signal metric used in quant selection. Utility maximisation, not the focus of this specific table, would move still closer to portfolio objectives.

When teams choose loss functions casually, they are not making a harmless engineering choice. They are choosing which mistakes the model is allowed to care about. In a top-ranked stock-selection strategy, getting the middle of the cross-section slightly better may not matter much. Misranking the names that enter the portfolio matters. In a risk-constrained strategy, return prediction without volatility awareness can be less useful than a weaker signal with better drawdown behaviour.

The paper’s business message is therefore blunt: model governance should include objective governance. Every research report should state not only which architecture was used, but why the loss function matches the investment decision. If that paragraph is missing, the model may still work. So may a vending machine kicked at the correct angle. Neither should be called a process.

Alpha decay turns model maintenance into a first-class cost

The alpha-decay experiment is a sensitivity and operational-maintenance test. It evaluates walk-forward updating schemes with rolling steps of 3, 6, and 12 months, plus a no-rolling baseline, using US stock data from 2021 to 2023 and the S&P 500 as benchmark.

The visual result is clear: the 3-month rolling scheme performs best, while the no-rolling approach performs worst. More frequent retraining helps address distribution shift and slows alpha decay. The catch is also clear: more frequent retraining increases compute and operational cost.

This is where the paper becomes directly useful to managers rather than just researchers. In production, a model is not a one-off artefact. It is an asset with maintenance capex. If alpha decays quickly, then retraining cadence is not a back-office scheduling issue. It becomes part of the investment strategy.

The trade-off can be stated simply:

Decision	Benefit	Cost	Governance question
Frequent rolling retraining	Slower decay; better adaptation to market evolution	More compute, more validation, more deployment risk	What is the marginal performance gain per retraining cycle?
Infrequent retraining	Lower infrastructure and review burden	Faster model staleness	How quickly does the signal decay after deployment?
No rolling update	Operational simplicity	Worst reported performance in this test	Is simplicity being confused with stability?

The paper points toward continual learning and online learning as future research directions. Cognaptus’ business inference is narrower: firms need an alpha-maintenance budget. Every AI strategy should specify how often the model is expected to degrade, how that decay is monitored, what triggers retraining, and who is accountable for approving the refreshed model.

Without that, “AI-driven investing” becomes a strange ritual: train once, deploy proudly, and watch the market quietly walk away.

Validation design is not admin work; it is model risk

The hyperparameter-tuning experiment is a robustness and implementation test. It compares validation-set construction methods: tail, random, and fragmented. It also compares normal training against retraining on the full dataset after hyperparameters are selected.

The result complicates the usual assumption that the validation segment immediately before the test set is always best. Random validation outperforms tail validation in both normal and retrain settings. Under normal training, random validation reports IC of 3.86%, return of 36.84%, and Sharpe of 1.8969, compared with tail validation at IC of 3.29%, return of 29.87%, and Sharpe of 1.5454. Under retraining, random validation again leads, with return of 41.36% and Sharpe of 1.9334.

Fragmented validation performs poorly under normal training but improves under retraining. The authors suggest that fragmented validation may produce more stable hyperparameters, though it needs further refinement.

The operational lesson is that validation is not a checkbox. In non-stationary markets, validation design encodes assumptions about which historical conditions matter for future deployment. A tail validation set assumes the most recent past is most representative. A random validation set exposes the model to more diverse regimes. A fragmented design tries to sample across history while preserving some temporal spread. Each choice is a market hypothesis.

A good quant review should therefore ask:

Validation choice	Hidden assumption	Failure mode
Tail validation	Recent history best represents near future	Overfits to the latest regime
Random validation	Diverse past patterns improve hyperparameter choice	Can blur temporal dependence if used carelessly
Fragmented validation	Stability comes from sampled historical fragments	May underperform unless retraining and segmentation are designed well

The paper does not settle the validation problem. It usefully makes it harder to ignore. That is enough.

Ensembles help because the signal is fragile, not because averaging is magic

The ensemble experiment is a robustness test against noisy optimisation. The authors train an MLP-Mixer model over 40 repeated runs with different random seeds on US volume-price data, average the predictions, and backtest the ensemble. Figure 5 shows substantial variance across runs, while the ensemble reduces variance and improves robustness.

The appendix extends this logic by showing low prediction correlation among different models on CSI300, suggesting potential for ensembling across model families as well as across repeated runs of the same architecture.

The mechanism is intuitive. In a low signal-to-noise environment, different random seeds can push the same model toward different noisy patterns. Averaging predictions can cancel some idiosyncratic errors. But this should not be oversold. Ensembling reduces variance; it does not manufacture economic signal from nothing. It can make a fragile process less fragile. It cannot make a bad thesis good.

For investment operations, ensembling has three uses:

Use	Practical value	Boundary
Same-model repeated runs	Reduces seed sensitivity	May hide instability rather than solve it
Cross-model ensembles	Captures diverse views when predictions are weakly correlated	Requires correlation monitoring and allocation logic
Strategy-level ensembles	Diversifies across signals or books	Can increase complexity, turnover, and governance burden

The best interpretation is that ensemble performance is a diagnostic. If a single model varies wildly by seed, the team has learned something uncomfortable about the signal environment. The ensemble may be useful, but the variance itself should trigger investigation. Robustness should not be outsourced to averaging and a hopeful spreadsheet.

More data helps only after it survives integration

The appendix experiment on information sources is an exploratory extension, not the main thesis. It asks whether adding volume-price (VP), fundamentals (F), news (N), industry (I), and Wikidata (W) improves model performance for a DNN and XGBoost on US stock data.

The answer: yes, but not monotonically, and not regardless of representation.

For the DNN, raw volume-price data alone produces IC of 3.43% but negative return and Sharpe. Adding fundamentals improves return sharply to 30.89% and Sharpe to 1.5592. Adding news improves IC to 4.07%, return to 31.97%, and Sharpe to 1.6825. But adding industry as graph information reduces IC to 1.97%, return to 13.51%, and Sharpe to 0.9173. Adding Wikidata gives IC of 3.80% but return of 18.06%, below the VPFN result.

For XGBoost, VP alone is weak and negative. Adding fundamentals improves IC but return remains negative. Adding industry finally produces a positive return of 1.74% and Sharpe of 0.0833, still modest. The VPFNW setting is not reported for XGBoost.

This appendix result is useful because it cuts against another easy belief: “just add alternative data”. Data breadth helps only if the model can ingest the information in a form that preserves economic meaning. Fundamentals and news appear useful in the reported setting. Graph-structured industry information hurts the DNN result, which the authors attribute to limitations of RGCN in that context. Wikidata improves IC but reduces returns, again reminding us that prediction metrics and monetisation can diverge.

The business inference is not “buy less data”. It is “test data integration as a model component”. A new dataset should be evaluated through incremental contribution, representation choice, leakage controls, stability, turnover impact, and downstream portfolio value. Otherwise, the firm may simply be purchasing more expensive noise.

What each experiment supports, and what it does not

A benchmark paper can easily become a buffet of tables. Useful, but dangerous if readers start turning every row into a law of markets. The experiments in QuantBench have different roles.

Paper component	Likely purpose	What it supports	What it does not prove
XGBoost vs LSTM on Alpha101/Alpha158	Main comparative evidence	Better IC does not always mean better portfolio metrics; engineered features and model class interact	Trees always beat neural networks
Broad architecture comparison	Main evidence across model families	RNNs, tabular models, and adaptive graph models can outperform more fashionable architectures in this setup	A permanent ranking of architectures
Training-objective comparison	Ablation-style objective test	Loss choice materially affects return, Sharpe, and IC	IC loss is universally optimal
Rolling retraining schemes	Sensitivity and maintenance test	Frequent updates reduce alpha decay in the reported period	Three-month retraining is always the right cadence
Validation-set selection	Robustness and implementation test	Validation design changes model outcomes and should be governed	Random validation is always superior
Ensemble experiment	Robustness test	Averaging reduces variance from noisy model fits	Ensembling solves weak signals
Model-correlation appendix	Exploratory ensemble rationale	Low model correlation suggests diversification potential	Any low-correlation model improves portfolio quality
Information-source appendix	Exploratory data-integration extension	More data can help, but representation matters	Alternative or relational data automatically improve returns
Dataset and limitation appendices	Implementation detail and boundary setting	Coverage is broad but still incomplete; backtesting realism can improve	The benchmark captures every real-world trading constraint

This table is the sober reading. Less fun than declaring a winner, but substantially cheaper than discovering the boundary in production.

The business value is an evaluation operating system

QuantBench’s business relevance is not that a CIO can read the paper, pick THGNN, retrain every three months, add fundamentals and news, ensemble everything, and then go home early. That would be convenient. Markets dislike convenience.

The value is that QuantBench provides a structure for asking better questions:

Does the chosen training objective match the investment decision?
Does predictive improvement survive portfolio construction?
Does the model degrade under market evolution, and how fast?
Does validation design reflect plausible deployment conditions?
Does relational data add information or merely decorate the feature space?
Does more data improve returns after integration costs and noise?
Does ensembling reduce variance because models are genuinely diverse, or because individual runs are unstable?
Are evaluation metrics task-specific enough for the actual use case?

For asset managers, this supports model selection and research governance. For fintech platforms, it supports product claims that are less embarrassing under due diligence. For AI-investing vendors, it provides a way to demonstrate process quality without pretending that a backtest is a prophecy. For internal quant teams, it helps allocate R&D budget: spend less time arguing about architecture prestige, more time measuring decay, objective alignment, validation stability, and data contribution.

A useful operational framework would look like this:

Operating layer	QuantBench lesson	Business action
Data	Breadth and depth matter, but integration matters more	Run incremental data-value tests before scaling vendor spend
Model	Architecture rankings are context-dependent	Maintain a model zoo, but select by portfolio outcome
Objective	Loss functions shape investment behaviour	Require objective-to-strategy justification in model review
Validation	Sampling design changes conclusions	Treat validation design as model-risk policy
Deployment	Alpha decays under distribution shift	Define retraining cadence and monitoring triggers
Robustness	Low signal-to-noise creates seed and model variance	Use ensembles, but investigate instability rather than hiding it
Evaluation	Signal, portfolio, and execution metrics differ	Report the full chain, not just IC

This is dull in the way good infrastructure is dull. It is also where most of the money is.

The limits are real, and they matter operationally

The paper is careful about future work. QuantBench still needs broader alternative data, deeper granularity such as order flow, more model architectures, more innovative formulations, and more realistic and efficient backtesting. These are not cosmetic gaps.

Order flow matters because many strategies die between portfolio intent and trade execution. Backtesting realism matters because slippage, liquidity, market impact, commissions, borrow constraints, and capacity can change rankings across strategies. Alternative data matters because some signals are not visible in standard volume-price or fundamental fields. Newer architectures matter because the model universe is moving quickly, and today’s benchmark suite will not stay complete for long.

There is also a broader interpretation boundary. The empirical results are based on selected markets, datasets, feature sets, periods, and backtest assumptions. That makes them useful for comparative diagnosis, not for universal investment claims. A model that underperforms in QuantBench may still work in a specialised strategy with different data, costs, horizons, constraints, or execution logic. A model that performs well in QuantBench may fail live once capacity, crowding, latency, and regime change arrive with their usual lack of manners.

So the right conclusion is not that QuantBench has solved quant AI evaluation. It has made the evaluation problem harder to evade. That is progress.

Conclusion: the benchmark is the product

The most important part of QuantBench is not the winning model. It is the benchmark discipline that makes “winning” a more meaningful word.

The paper’s comparisons repeatedly undermine the easy story that quant AI advances by stacking more expressive models on wider datasets. DNNs can improve IC without improving portfolio outcomes. Transformers can underperform older sequence models. Graphs can help or harm depending on representation. More information can improve results, then damage them when integrated poorly. Frequent retraining can slow alpha decay, but it adds cost. Ensembles can reduce variance, but only because the underlying environment is noisy enough to make variance a problem in the first place.

For business users, this turns QuantBench into an operating philosophy. Evaluate the whole chain. Match objectives to decisions. Treat data integration as a model-design problem. Monitor decay. Govern validation. Use ensembles as robustness tools, not as superstition. Above all, stop confusing architecture novelty with investment edge.

Trees still matter. Deep roots still matter. But in quant AI, the forest is the benchmark.

Cognaptus: Automate the Present, Incubate the Future.

Saizhuo Wang, Hao Kong, Jiadong Guo, Fengrui Hua, Yiyan Qi, Wanyun Zhou, Jiahao Zheng, Xinyu Wang, Lionel M. Ni, and Jian Guo, “QuantBench: Benchmarking AI Methods for Quantitative Investment,” arXiv:2504.18600, 2025, https://arxiv.org/abs/2504.18600. ↩︎

TL;DR for operators#

The quant model beauty contest is testing the wrong thing#

IC is not P&L, and the difference is where money goes to sulk#

Newer architectures do not get a participation trophy#

Static relationships are cheap; useful relationships are learned#

Loss functions are investment policy in disguise#

Alpha decay turns model maintenance into a first-class cost#

Validation design is not admin work; it is model risk#

Ensembles help because the signal is fragile, not because averaging is magic#

More data helps only after it survives integration#

What each experiment supports, and what it does not#

The business value is an evaluation operating system#

The limits are real, and they matter operationally#

Conclusion: the benchmark is the product#