Trading on Memory: Why Markov Models Miss the Signal

TL;DR for operators

A trader usually asks, “What is the signal now?” This paper asks a more expensive question: “What did the signal do on the way here?” That difference matters when alpha does not decay instantly, when order flow moves prices slowly, or when volatility changes the usefulness of the same forecast.

The paper introduces a kernel-based framework for dynamic, path-dependent mean-variance trading. Instead of forcing the trading position to depend only on the current state, the strategy can depend on the whole observed trajectory of prices, forecasts, or other market-state variables. The authors represent that strategy inside a reproducing kernel Hilbert space, derive a closed-form solution for the mean-variance objective, and add a spectral decomposition method to reduce numerical instability.¹

The operational claim is not “kernels beat Markowitz.” That would be wonderfully convenient and therefore suspicious. The paper’s actual claim is narrower and more useful: Markovian methods become suboptimal when the asset dynamics or predictive signals contain temporal structure that affects future PnL and terminal risk. In those settings, path-dependent methods can extract information that a current-state model throws away.

The experiments support that narrower claim. In a synthetic path-dependent drift setting, kernel trading strongly outperforms a linear Markovian strategy when the drift has slow memory. In the MSFT intraday experiment, the authors use real 5-minute price bars but controlled synthetic predictive signals, allowing them to test signal decay and heavy-tailed residual noise without pretending that a mysteriously perfect live alpha just fell from the sky. Kernel trading and signature trading both outperform the Markovian baseline out of sample, with broadly similar performance.

The business relevance is clearest for quant teams already building signal factories. If a desk has alpha signals with non-instantaneous decay, cross-impact effects, lagged reactions, or volatility-dependent reliability, this framework offers a structured way to convert full trajectories into inventory decisions without training a deep model by gradient descent. The price is engineering complexity: Gram matrices, scaling, regularisation, eigenvalue truncation, memory pressure, and online evaluation against historical paths. Memory is useful. It is not free.

The familiar mistake is treating every signal as a snapshot

Most trading systems already contain memory. They just hide it in feature engineering.

A “current” signal may include moving averages, lagged returns, realised volatility, order-flow imbalance, cross-asset flows, or some learned representation of recent behaviour. By the time the model sees the number, history has already been compressed into it. The problem is that this compression is usually fixed before the portfolio optimiser enters the room. The optimiser receives a neat current-state feature and behaves as if the past has done its job and politely left.

That is the Markovian convenience. It says the current state is enough.

Sometimes it is. If the alpha impact is nearly instantaneous, if volatility is stable enough, if execution risk is not strongly history-dependent, and if the signal’s past shape adds little beyond its current value, then a Markovian strategy can be the right blunt instrument. Quant finance has survived on blunt instruments for a long time; some even wear suits.

But the paper’s central point is that many trading problems are not like that. A signal’s path can matter because alpha decays over time. Order flow can affect prices gradually. Volatility can change the reliability of forecasts. Two signals can have the same instantaneous predictive strength and still produce different trading outcomes because their temporal structures differ.

That is where the comparison becomes useful. This paper is best read as a three-way contest among:

Approach	What it remembers	What it buys	What it costs
Classical Markowitz / Markovian strategy	Current expected return and covariance-style risk	Simplicity, interpretability, fast decision rules	Misses temporal structure when alpha decay or drift is path-dependent
Signature trading	A truncated algebraic summary of the path	Structured path dependence with more interpretability	Truncation choice, feature explosion at higher order
Kernel trading	Similarity between whole paths through a kernel feature space	Flexible path-dependent modelling without explicit feature expansion	Gram matrices, scaling sensitivity, regularisation, memory and online compute

The paper does not ask the reader to abandon Markowitz. It asks the reader to stop using Markowitz as if every signal were a still photograph.

What the kernel trader actually does

The paper’s technical move is to represent the trading inventory as a function of the observed trajectory. That trajectory may include the asset path, predictive signals, order flow, volatility measures, or other market-state variables. The function lives in an RKHS, meaning it can be manipulated through kernel evaluations rather than explicit high-dimensional features.

In plain language: the strategy learns how similar the current market path is to previously observed paths, then uses that similarity structure to determine the trading position.

The authors define a kernel trading strategy in which the position is produced by applying a function to a feature embedding of the path. The feature embedding is deliberately flexible. It could be based on signatures, randomised signatures, dynamic time warping, alignment kernels, convolutional kernels, sequential kernels, or even representations from neural architectures. The framework is not married to one representation, which is both its strength and its future source of delightful implementation headaches.

The key mathematical object is a PnL feature map. Rather than merely measuring similarity between paths, the framework constructs a feature representation tied to the PnL generated by trading along those paths. That allows the mean and variance of terminal PnL to be expressed through inner products and Gram matrices.

The mean-variance objective is the familiar shape:

$$ \mathbb{E}[V_T] - \frac{\lambda}{2}\operatorname{Var}(V_T), $$

where $V_T$ is terminal PnL and $\lambda$ is the risk-aversion parameter. The novelty is not the objective. The novelty is the class of strategies over which the objective is optimised.

The paper then derives closed-form optimal weights for the kernel trading strategy. The solution keeps the same broad financial intuition as classical mean-variance optimisation: expected return interacts with a covariance-like object. But now the covariance lives in the PnL feature space, where whole path histories can matter.

This is why the paper is not a “deep learning for trading” article. It is almost the opposite. It offers a flexible nonlinear path-dependent model while preserving an analytic fitting step. No backpropagation ritual. No GPU incense. Still plenty of matrix pain, naturally.

Why Markowitz loses when the drift has memory

The strongest conceptual section of the paper is its comparison between terminal mean-variance optimisation and an instantaneous variance constraint.

A local Markowitz-style strategy focuses on the current expected return and a static covariance penalty. That is appropriate when the relevant risk is instantaneous. But the paper’s objective penalises terminal PnL variance. Once the drift is stochastic and path-dependent, terminal risk depends not only on today’s position but on how future drift and future positions co-move.

That creates a hedging-demand term. The strategy should care about future changes in expected return, not only the current value of the signal. This is the theoretical reason a Markovian strategy can become suboptimal even when it appears to have access to the right current signal.

The authors make this concrete with a stochastic drift model:

$$ dX_t = \mu_t dt + dM_t. $$

If $\mu_t$ carries temporal structure, the optimal terminal mean-variance strategy needs to anticipate the future decay and covariance structure of that drift. A current-state Markowitz rule cannot fully do that. It sees the signal; it does not see the signal’s biography.

This distinction matters for business use because many real signals are not single-period facts. A forecast may represent information that diffuses through the market. Order-flow effects may decay slowly. Related-asset information may influence price discovery with a lag. The more the value of the signal depends on its path, the more dangerous it is to optimise as if the signal were memoryless.

The synthetic drift test is main evidence, not decorative simulation

The first numerical experiment is a controlled synthetic setting. Its purpose is main evidence: it tests whether the kernel trader can exploit path-dependent drift when a Markovian baseline cannot.

The authors construct a one-dimensional asset where the trader observes a Markovian raw signal, but the drift is generated by applying a power-law memory kernel to that signal. A smaller decay parameter means slower decay and longer memory. The trading problem is therefore deliberately designed so that the history of the signal matters.

The experiment asks two questions. First, can the method learn the drift when it only sees the raw signal rather than the true drift? Second, can the method use temporal autocorrelation in the signal or drift to reduce terminal PnL variance?

The answer is mostly yes, with useful nuance. When the decay is slow, the kernel trader produces large outperformance relative to the Markovian signal strategy. In the figure, outperformance is especially high at small decay values and falls as the decay becomes faster. That is exactly the shape one should expect if the advantage comes from memory rather than generic model complexity.

The paper also compares a Markowitz-style strategy that has access to the true drift. This is a useful sanity check. The kernel trader that only sees the raw signal comes close to the kernel trader with access to the true drift, but does not fully match it. That matters. The method is learning a lot of the hidden path-dependent structure, not performing financial clairvoyance. Annoying, perhaps, but refreshingly honest.

A second panel varies the signal-to-noise relationship. As the relationship between drift and returns becomes cleaner, performance improves. Again, this is unsurprising in the right way. Better information helps, and memory helps more when the information has a temporal pattern worth remembering.

Test	Likely purpose	What it supports	What it does not prove
Synthetic path-dependent drift	Main evidence	Kernel trading can exploit slow signal decay and drift memory better than a Markovian signal rule	That the same edge exists in arbitrary live markets
Access to raw signal versus true drift	Main evidence / diagnostic	The method can learn much of the drift structure from observable information	That hidden drift is fully recoverable
Varying signal-to-noise ratio	Sensitivity test	The advantage grows when the underlying information is cleaner	That the method is robust to all weak-signal regimes

The business reading is simple: do not use this experiment as a performance forecast. Use it as a diagnostic template. If a desk believes its signal has delayed impact, slow decay, or path-dependent reliability, this is the kind of controlled experiment it should run before giving the strategy capital and a dramatic internal name.

The MSFT experiment is useful because it is only half real

The market-data experiment uses 5-minute Microsoft bars during regular trading hours, giving 78 trading times per day. The authors train on 2020–2023 and test on January 2024 through March 2025, producing roughly 78,000 in-sample trades and 20,000 out-of-sample trades.

That sounds like a real trading experiment. It is partly that, but not fully. The authors use real price data as the market substrate and synthetic predictive signals as controlled inputs. This design is not a weakness; it is the point.

A fully real alpha signal would raise the usual question: did the model work, or was the signal just unusually lucky? By constructing synthetic signals with controlled $R^2$, forecast horizon, decay speed, and heavy-tailed residual noise, the authors can isolate how temporal signal structure affects strategy performance. It is cleaner science, even if less exciting for anyone hoping the paper had discovered a tradable MSFT money printer. It did not. Please tell the interns.

The experiment uses a 15-minute forecast horizon and an $R^2$ of 0.5%, chosen to demonstrate low signal-to-noise conditions. The residual noise model can incorporate heteroskedasticity and heavy tails through a stochastic-volatility parameter. The signal can also decay according to a power-law kernel.

The result: kernel trading and signature trading consistently outperform the Markovian strategy that is linear in the predictive signal. The advantage is present across faster and slower decay settings, with a marginal advantage when decay is slower. As the synthetic signal becomes more heavy-tailed through the stochastic-volatility parameter, path-dependent methods still outperform, but uncertainty widens.

This is the important interpretation: two signals can have the same headline predictive strength and still produce different strategy outcomes because their temporal structures differ. A signal’s $R^2$ is not its business model. It is one line in the pre-mortem.

The paper runs 30 random seeds to generate synthetic predictions, reporting distributions rather than a single convenient path. This is a robustness-oriented design choice. It does not solve all market-data problems, but it reduces the chance that the comparison is merely one lucky draw from a noisy synthetic signal generator.

Signature trading is the nearest rival, not a straw man

The comparison with signature trading is the most practical part of the article for method selection. Signature trading and kernel trading are not enemies. They are cousins who disagree about how much of the path representation should be explicit.

Signature trading represents the strategy as a linear functional on a truncated signature of the path. The signature captures ordered interactions along the trajectory, but must be truncated at some finite order. Higher truncation adds complexity and potentially captures richer path effects, but also increases computational burden and overfitting risk.

Kernel trading, when using the signature kernel, avoids explicit truncation by computing inner products between signatures through the kernel trick. In principle, this gives access to richer feature spaces. In practice, it replaces explicit feature explosion with Gram-matrix computation and kernel-evaluation costs. Finance never removes a constraint; it usually relocates it to a less obvious spreadsheet.

The paper’s performance comparison is measured rather than triumphalist. In sample-size convergence experiments using an OU-style setting, kernel and signature methods show broadly comparable performance once regularised. The kernel method converges slightly faster and has lower out-of-sample variance for smaller sample sets, suggesting greater robustness in that particular setup. But both methods are capable of learning simple Markovian dynamics directly from the underlying process.

A separate path-length experiment shows that the out-of-sample advantage over Markowitz grows as path length increases. That supports the mechanism: the more path there is to learn from, the more a path-dependent method can matter.

Another experiment increases signature truncation order and shows the signature method converging toward the kernel method. This is best read as comparison with prior work, not as a general dominance claim. The kernel method behaves like an upper benchmark for the truncated signature method in that setting because the signature trader approximates the richer signature-kernel representation as truncation increases.

The practical result is a decision rule:

If the problem looks like this	Prefer
Single asset, low-dimensional inputs, moderate path effects, need interpretability	Signature trading
High-frequency online latency matters more than feature richness	Signature trading
Many signals/assets, nonlinear path interactions, richer feature maps	Kernel trading
Research platform exploring multiple kernels and signal embeddings	Kernel trading
Small dataset with unstable kernel Gram matrices	Neither blindly; regularisation and validation decide

The correct conclusion is not “kernel trading replaces signature trading.” The correct conclusion is “kernel trading gives the research team a broader function class, and then sends the engineering bill.”

The spectral fix is robustness, not a new alpha source

Kernel methods often require inverting matrices that are nearly singular. In the paper, the relevant object is the Gram matrix associated with the PnL feature map. When regularisation is too small, direct inversion can become unstable. A mathematically elegant solution that explodes numerically is still a bad trading system. Elegant bankruptcy remains bankruptcy.

The authors address this with a spectral representation of the optimal weights. Instead of using all eigenvectors, the solution can be truncated to the dominant components. This is effectively a low-rank stabilisation method: keep the signal-bearing directions, discard the directions that mainly amplify numerical noise.

This section is a robustness and implementation contribution. It should not be interpreted as a separate trading insight. Its role is to make the kernel framework usable when the Gram matrix is ill-conditioned.

The paper’s Figure 14 illustrates the trade-off. Using the full solution with all eigenvectors can become unstable for small regularisation values. Reducing the number of eigenvectors smooths the objective curve. But removing too many eigenvectors destroys performance because it removes signal as well as noise. In the illustrated setting, the authors observe a useful region around 50 to 500 eigenvectors, while a very small number such as 5 removes too much information.

That is the practical lesson: spectral truncation is not a magic denoiser. It is another hyperparameter. It must be tuned, validated, and monitored. The model has not escaped model selection. It has acquired a more mathematical accent.

The implementation burden is where the business case becomes honest

The paper’s implementation section is unusually important because the framework’s value depends on whether a team can run it reliably.

There are three main knobs.

First, explicit regularisation. Too little regularisation overfits; too much underfits. The paper notes that this is especially sensitive for smaller datasets. The good news is that once the Gram matrix has been computed, tuning the regularisation parameter is relatively cheap compared with recomputing the kernel features.

Second, path scaling. In signature-kernel settings, scaling the input path changes the relative influence of higher-order terms. Small scaling can suppress nonlinear high-order structure. Large scaling can overemphasise it, overfit, or cause kernel convergence problems. Scaling therefore acts like implicit regularisation. This is easy to miss and expensive to debug.

Third, spectral rank. The number of eigenvectors retained controls the stability-expressivity trade-off. Too many and the solution can become unstable; too few and useful signal disappears.

The computational picture is also mixed. Kernel trading requires Gram matrices of path similarities. With $N$ sample paths, this creates $N \times N$ kernel evaluations. The paper’s signature-kernel implementation involves PDE-based kernel computation, making path length and sample size important runtime drivers. Online deployment also requires evaluating the current path against co-location trajectories, which can be a latency problem.

The runtime comparison with signature trading shows no universal winner. For lower truncation orders and longer paths or more samples, signature trading can be faster. At higher truncation orders, the signature kernel becomes comparatively attractive. When input dimensionality increases, the inflection point shifts; in the paper’s benchmark with signature truncation order 4, the boundary appears around five input channels. These results were runtime-only and recorded on two NVIDIA GeForce RTX 2080 Ti GPUs; memory usage was not fully benchmarked.

For operators, the relevant question is not “Is the method closed-form?” It is: “Which part of the stack becomes the bottleneck?”

Operational component	Kernel trading implication
Research fitting	Gram-matrix computation dominates; batching and parallelisation matter
Hyperparameter search	Regularisation, path scaling, and spectral rank need validation
Online inference	Current paths must be compared with historical co-location paths
Memory	PDE-based signature kernels and Gram matrices can constrain batch size
Monitoring	Stability must be checked across regularisation and eigenvalue truncation
Scaling strategy	Nyström-style path subsampling can reduce kernel evaluations, but landmark selection becomes another modelling choice

This is where the business case becomes sober. Kernel trading is attractive when the value of modelling memory exceeds the cost of maintaining the machinery.

What the paper directly shows, and what Cognaptus infers

The paper directly shows three things.

First, it derives a kernel/RKHS formulation for dynamic path-dependent mean-variance trading, including closed-form optimal weights expressed through PnL feature-map Gram matrices.

Second, it shows that path-dependent kernel and signature methods outperform Markovian baselines in controlled settings where temporal structure is present: synthetic path-dependent drift and real MSFT intraday bars with controlled synthetic signals.

Third, it demonstrates that the kernel framework’s practical performance depends heavily on regularisation, path scaling, spectral truncation, and compute/memory trade-offs.

Cognaptus infers a more operational point: this is not primarily a trading strategy paper. It is a signal-infrastructure paper.

A quant platform that produces signals should not only ask whether a signal predicts returns. It should ask how the signal decays, how its reliability changes with volatility, whether its path shape matters, and whether a current-state optimiser is throwing away useful information. Kernel trading provides one formal way to answer that question and convert the result into inventory.

That inference is strongest for firms with:

many related signals or assets;
non-instantaneous alpha decay;
order-flow, cross-impact, or lagged information effects;
research infrastructure for walk-forward validation;
enough compute to handle path kernels and Gram matrices;
a need for flexible nonlinear modelling without full deep-learning training loops.

It is weaker for firms whose constraints are dominated by transaction costs, market impact, latency, or extremely low signal-to-noise ratios. It is also weaker when interpretability and operational simplicity are worth more than marginal modelling flexibility. In finance, “more expressive” is not a synonym for “more profitable.” It is a synonym for “more ways to overfit before lunch.”

The missing production layer is not a footnote

The paper is careful about several limitations, but the business reader should make them explicit.

The experiments do not provide a full live trading study with transaction costs, spread, slippage, fees, queue position, market impact, borrow constraints, or risk limits. The MSFT experiment uses real prices but synthetic predictive signals. That is useful for isolating mechanism, but it does not establish commercial alpha.

The objective is mean-variance with terminal PnL risk. That is a clean research objective, but production trading often cares about drawdowns, turnover, capacity, intraday risk limits, execution constraints, and capital allocation across strategies. The paper mentions future extensions such as market impact and transaction costs, but those are not solved here.

The framework also assumes that historical trajectories provide a useful basis for online decisions. In non-stationary markets, the relevance of historical co-location paths can decay or shift. Nyström subsampling, recency-weighted landmarks, and validation windows can help, but they introduce additional modelling choices.

Finally, the method’s flexibility creates governance questions. Kernel features can be less interpretable than truncated signature terms. A risk committee may accept a Markowitz allocation even when it is mediocre because it can be explained. A kernel trader that compares today’s path to a cloud of historical trajectories may perform better, but “trust the RKHS” is not a control framework.

The actual lesson: memory is a modelling asset

The paper’s title could have been “Mean-Variance Optimisation Discovers Time Has Passed.” Less elegant, perhaps, but accurate.

The central lesson is that the optimiser should not be blind to the temporal structure of the information it uses. If alpha decays slowly, if drift is path-dependent, or if signal reliability changes with the path of volatility, then a snapshot strategy is deliberately underinformed. Kernel methods offer a flexible way to retain that memory while keeping an analytic solution.

But the paper is also a useful antidote to its own excitement. Kernel trading is not a universal upgrade over Markowitz or signature trading. It is a more general framework with stronger modelling capacity and heavier practical obligations. It can exploit memory when memory contains useful information. It can also overfit, destabilise, consume memory, and arrive late to the market with a very sophisticated explanation.

For a quant desk, the right takeaway is not to replace every Markovian strategy with a kernel trader. The right takeaway is to audit where the desk is pretending that current-state features are enough. When a signal’s history changes its value, the trading system should know that history.

The market remembers. The model might as well.

Cognaptus: Automate the Present, Incubate the Future.

Owen Futter, Nicola Muça Cirone, and Blanka Horvath, “Kernel Learning for Mean-Variance Trading Strategies,” arXiv:2507.10701, 2025. https://arxiv.org/abs/2507.10701 ↩︎

TL;DR for operators#

The familiar mistake is treating every signal as a snapshot#

What the kernel trader actually does#

Why Markowitz loses when the drift has memory#

The synthetic drift test is main evidence, not decorative simulation#

The MSFT experiment is useful because it is only half real#

Signature trading is the nearest rival, not a straw man#

The spectral fix is robustness, not a new alpha source#

The implementation burden is where the business case becomes honest#

What the paper directly shows, and what Cognaptus infers#

The missing production layer is not a footnote#

The actual lesson: memory is a modelling asset#