Simulate First, Invest Later: How Diffusion Models Are Reinventing Portfolio Optimization

TL;DR for operators

Portfolio teams do not lack optimisation formulas. They lack enough relevant future scenarios. That is the problem this paper attacks.

The paper proposes a diffusion-based market simulator that learns from historical time-series data, then generates conditional future paths based on the current market state.¹ Those generated paths become the training environment for a reinforcement-learning portfolio agent. In plain terms: instead of asking an RL policy to learn from a thin archive of market history, the system first builds a synthetic scenario engine and lets the policy practise there. Sensible. Also dangerous, if the simulator hallucinates a market that conveniently rewards your model.

The important technical move is not “diffusion model beats Markowitz.” That would be the cheap headline, and the cheap headline is usually where portfolio research goes to become LinkedIn compost. The actual contribution is more careful: the authors adapt score-based diffusion to sequential data, prove error bounds under adapted Wasserstein distance, and then prove that dynamic mean-variance values are stable when the surrogate model is close under that metric.

For business use, the relevant pathway is:

Historical financial data is scarce relative to the dimensionality and regime variation of the decision problem.
A conditional diffusion model generates many state-aware future paths.
Those paths provide a scenario pool for portfolio optimisation or RL training.
Stability theory gives a reason—not a guarantee of profit, but a reason—to care about simulator accuracy.
Backtests on ten U.S. industry portfolios show promising risk-adjusted performance, especially for the generative TD3 strategy.

The boundary is equally important. The real-data experiment uses monthly excess returns for ten broad industry portfolios, not a live multi-asset institutional book with frictions, taxes, liquidity constraints, slippage, mandate rules, and career risk. The authors also retrain the RL agent only once in the 15-year evaluation window. So the paper is best read as a serious architecture for scenario-driven dynamic allocation, not as a product brochure for pressing “generate alpha.”

The old portfolio problem is not optimisation. It is the thinness of reality

A portfolio manager can solve a Markowitz optimisation before finishing coffee. The unpleasant part comes earlier: estimating the return distribution that goes into the optimisation.

Classical mean-variance allocation asks for expected returns, covariances, and a risk-aversion parameter. In real markets, those inputs behave like office politics: observable, influential, and never quite stable. Historical windows are short. Regimes shift. Estimation error dominates elegant algebra. A 60-month rolling covariance estimate may be mathematically well-behaved and still economically stale by the time the committee has finished approving it.

Dynamic portfolio selection makes the problem worse. The investor does not choose one allocation and vanish. They rebalance across time. Today’s action changes tomorrow’s wealth. Tomorrow’s information changes the action after that. The decision process is sequential, conditional, and non-anticipative: the strategy may use information available up to time $t$, but not information from time $t+1$ that has not yet happened.

This is why the paper’s mechanism matters. It does not simply train a generator to produce market-looking samples. It trains a generator that can condition on the path so far and then produce the next step in a way that respects the temporal structure of the decision problem.

That is a higher bar than “make synthetic returns with plausible volatility.” A spreadsheet can do that. Badly, but confidently.

The mechanism: turn scarce history into a conditional scenario engine

The paper’s workflow is easiest to understand as a chain:

Stage	What happens	Why it matters operationally
Historical market paths	The model starts with observed time-series data from the real but unknown market model $P$.	The real distribution is not specified parametrically. The system is model-free in that sense.
Adaptive diffusion training	A score network and RNN encoder are trained to learn conditional time-series dynamics.	The RNN compresses the path history so the diffusion model can condition on current market state.
Conditional scenario generation	The sampler generates future paths step by step from a surrogate model $Q$.	Portfolio training receives many scenario paths rather than one thin historical path.
Dynamic portfolio learning	A policy-gradient / TD3-style agent trains inside the generated environment.	The portfolio policy can learn sequential allocation rules from synthetic but state-conditioned scenarios.
Evaluation	Generated-policy performance is tested on synthetic and real market data.	The backtest asks whether the mechanism survives contact with observed returns.

The first technical issue is that ordinary diffusion models are usually built for static data objects. An image generator does not need to worry about whether the top-left pixel was “known” before the lower-right pixel. A dynamic portfolio policy does. Time ordering is not decoration; it is part of the problem.

The authors therefore use the adapted Wasserstein metric, denoted $AW_2$, rather than ordinary Wasserstein distance. Ordinary Wasserstein distance compares distributions as if paths were just points in a large Euclidean space. Adapted Wasserstein distance is designed for stochastic processes where the coupling must respect filtration structure—roughly, the information revealed over time. This distinction is not just mathematical taste. Prior results show that dynamic value functions can be unstable under standard probabilistic metrics, so using the wrong metric can make simulator closeness irrelevant to decision quality.

The paper’s adaptive sampler generates a trajectory sequentially. It first samples the initial point from noise through a reverse diffusion process. Then, to generate the next time step, it conditions on the already generated path and runs a conditional reverse diffusion step. Repeat until a full path exists.

That is the simulator version of not cheating.

The score network is the engine; the RNN is the memory

In the implementable version, the model contains two neural components:

Component	Role in the mechanism
Score network $s_\theta$	Learns the score function used by the reverse diffusion process to denoise samples into plausible market states.
RNN encoder $R_\theta$	Compresses the historical path into a feature representation used as conditional input.
Policy network $\pi_\beta$	Maps encoded market state into portfolio weights during the TD3-style allocation procedure.

The RNN matters because conditioning directly on an expanding path creates a dimensionality problem. If every past return vector becomes an input, the conditioning object grows as the time horizon lengthens. The paper’s practical fix is to encode the path into hidden variables. The score network and RNN encoder are trained jointly, so the simulator learns not just “what returns look like,” but what compressed state representation is useful for conditional generation.

This is a useful business distinction. A static generator gives you more samples. A conditional generator gives you samples relevant to now.

That word “relevant” is where investment value might exist. A scenario set generated after a crisis window should not look identical to one generated after a low-volatility expansion. If it does, it is not a scenario engine; it is a very expensive screensaver.

The theory says simulator error should matter for portfolio value

The paper’s theoretical argument has two connected parts.

First, the adaptive diffusion sampler can approximate the real time-series distribution $P$ with a surrogate distribution $Q$ under adapted Wasserstein distance, provided the score-matching error is small and the technical assumptions hold. The paper states the core result as an $AW_2$ error bound between $P$ and the generated path distribution $Q_T$, with the bound controlled by score-matching error and diffusion-time approximation terms.

Second, the dynamic mean-variance portfolio problem is stable under that same metric. In simplified language:

$$ |v^\ast(P) - v^\ast(Q)| \leq C , AW_2(P, Q) $$

when the relevant assumptions are satisfied.

This matters because it connects generator quality to portfolio decision quality. Without that bridge, synthetic scenarios are just numerically convenient fiction. With the bridge, improving the simulator has a theoretically meaningful relationship to the value of the allocation problem.

The authors also use a duality argument. Dynamic mean-variance optimisation is time-inconsistent, which means the usual Bellman machinery does not apply cleanly to the original objective. The paper works through a quadratic hedging dual problem, derives a dynamic-programming structure there, and then uses it to justify the policy-gradient approach.

For operators, the exact proof machinery is less important than the governance implication: the simulator cannot be evaluated only by whether its generated returns “look realistic.” It must be evaluated by whether decision values are stable when moving from real to generated distributions. Pretty samples are not enough. Markets have enough aesthetic fraud already.

The policy agent trains in a generated market, not in the archive

The reinforcement-learning part is not treated as an ordinary trading game with endless replay. The authors explicitly address a practical problem: long historical archives may cover many regimes but mix incompatible market conditions; short recent windows are more relevant but too small for effective RL training.

Their solution is to train the agent on a scenario pool generated from the current context. The diffusion model produces many one-year paths conditioned on recent data. The TD3-style agent uses those paths as its environment and learns allocation actions. In the real-data experiment, the trained GenTD3 agent is used for the first 7.5 years, then retrained once at the midpoint using a fresh one-year context window and another set of generated paths.

That retraining choice is not incidental. It acknowledges that a scenario generator is not a fossil. It should be refreshed when the market state changes. The authors note that annual retraining would be more aligned with the one-year test horizon, but they retrain only once because of limited computational resources.

This is one of the paper’s most practical details. In production, the retraining calendar would become a risk-control variable. Too slow, and the model fossilises. Too fast, and the system may chase noise with a GPU bill attached.

The synthetic experiment checks the RL machinery, not the full thesis

The synthetic experiment has a narrow purpose: it tests whether the policy-gradient component can learn a useful allocation rule when the scenario pool is sampled from a known distribution.

The setup uses ten assets with normally distributed returns. Short selling and borrowing are prohibited, so portfolio weights sit in the simplex. The authors compare equal weight, static Markowitz with knowledge of the true model, and TD3 trained via the proposed algorithm. This is not yet a test of diffusion generation. The scenario pool is sampled from the known return model. It is a check on whether the RL machinery can use a large scenario pool effectively.

The result is encouraging but modest in the correct way. After 10,000 training steps, TD3 reports annual return of 13.01%, volatility of 0.23, Sharpe of 0.61, and mean-variance value of 0.09. Equal weight reports 12.96%, volatility 0.25, Sharpe 0.51, and mean-variance value 0.07. Static Markowitz reports 13.92%, volatility 0.24, Sharpe 0.57, and mean-variance value 0.09.

Synthetic strategy	Return	Volatility	Sharpe	MV
Equal Weight	12.96%	0.25	0.51	0.07
Static Markowitz	13.92%	0.24	0.57	0.09
TD3	13.01%	0.23	0.61	0.09

The paper’s Figure 3 is best read as a training diagnostic: it shows performance after each training epoch for return, Sharpe, and mean-variance value. Figure 4 is illustrative: it shows one testing scenario, including portfolio weights and wealth trajectory. These figures support feasibility of the RL component, not the claim that diffusion-generated markets are sufficient for real deployment.

That distinction matters. The synthetic test says the agent can learn from scenarios. It does not yet say the diffusion model can generate the right scenarios.

The real-data experiment is the main evidence

The real-data experiment is where the full mechanism is assembled.

The authors use value-weighted ten-industry portfolios from the Kenneth R. French Data Library. The monthly excess-return series runs from July 1926 to March 2025. The test dataset covers April 2009 to March 2025, with the first year used as a historical context window, leaving an evaluation period from March 2010 to March 2025. The validation dataset covers August 1992 to March 2009, and the remaining earlier data is used to train the diffusion model.

The strategies compared are:

Strategy	What it does
S&P 500	Market benchmark.
Equal Weight	Allocates equally across the ten industry portfolios.
HistMarkowitz	Monthly constrained Markowitz using sample mean and covariance from the most recent 60 months.
GenMarkowitz	Uses the diffusion model to generate 500 one-month predictions from the recent one-year context, then solves constrained Markowitz.
GenTD3	Uses diffusion-generated one-year paths as the scenario pool for a TD3 policy agent.

At risk aversion $\gamma = 3$, both generative strategies beat S&P 500, equal weight, and historical Markowitz on Sharpe and Sortino. GenMarkowitz has the highest return, but GenTD3 has the strongest Sharpe, Sortino, and lowest volatility among the listed strategies.

Strategy	Return	Volatility	Sharpe	Sortino	Max Drawdown	Calmar
S&P 500	10.98%	0.1460	0.7522	0.6955	-0.2477	0.4433
Equal Weight	13.26%	0.1502	0.8831	0.8310	-0.2293	0.5783
HistMarkowitz	8.08%	0.1553	0.5200	0.5226	-0.3014	0.2679
GenMarkowitz	13.59%	0.1536	0.8850	0.8526	-0.2690	0.5053
GenTD3	12.80%	0.1398	0.9158	0.8966	-0.2325	0.5507

The interpretation is not “RL makes more money.” At $\gamma = 3$, GenTD3 returns less than equal weight and GenMarkowitz. Its advantage is risk-adjusted: lower volatility and better Sharpe/Sortino.

That is a more interesting claim. Portfolio institutions usually do not buy raw return claims from academic backtests, because raw return claims age like unrefrigerated sushi. A strategy that improves risk-adjusted behaviour by conditioning on current scenario structure is at least pointing at the right problem.

Risk aversion is not a footnote; it changes the conclusion

The robustness test varies the risk-aversion parameter $\gamma$. This is not a cosmetic sensitivity check. It matters because $\gamma$ controls the mean-variance trade-off, and in practice nobody knows the “correct” risk aversion. Institutions reverse-engineer it from mandate constraints, drawdown tolerance, board politics, and whatever happened last quarter.

At lower risk aversion, $\gamma = 0.5$, GenTD3 looks strongest:

Strategy	Return	Volatility	Sharpe	Sortino	Max Drawdown	Calmar
S&P 500	10.98%	0.1460	0.7522	0.6955	-0.2477	0.4433
Equal Weight	13.26%	0.1502	0.8831	0.8310	-0.2293	0.5783
HistMarkowitz	6.18%	0.2043	0.3025	0.2973	-0.4090	0.1511
GenMarkowitz	12.02%	0.2418	0.4970	0.5121	-0.5563	0.2160
GenTD3	13.57%	0.1440	0.9428	0.9774	-0.2199	0.6405

At higher risk aversion, $\gamma = 5$, GenMarkowitz becomes the standout:

Strategy	Return	Volatility	Sharpe	Sortino	Max Drawdown	Calmar
S&P 500	10.98%	0.1460	0.7522	0.6955	-0.2477	0.4433
Equal Weight	13.26%	0.1502	0.8831	0.8310	-0.2293	0.5783
HistMarkowitz	9.26%	0.1375	0.6737	0.6909	-0.2605	0.3555
GenMarkowitz	14.17%	0.1360	1.0418	1.0543	-0.1838	0.7711
GenTD3	12.73%	0.1392	0.9150	0.8916	-0.2326	0.5474

This robustness section is doing real interpretive work. It shows that the diffusion generator is not attached to one downstream optimiser. It can feed a static Markowitz-style optimiser or a dynamic TD3 agent. But it also shows that the downstream choice and risk-aversion setting matter materially.

The GenMarkowitz strategy is volatile across $\gamma$: weak at $\gamma = 0.5$, excellent at $\gamma = 5$. GenTD3 is more stable across the tested values, though not always the winner. That suggests a practical division of labour:

Use case	More natural fit
Need a familiar optimiser with generated inputs	GenMarkowitz
Need a sequential policy that reacts to encoded state	GenTD3
Need robustness across uncertain risk-aversion settings	GenTD3 looks more stable in these tests
Need highest result in this specific high-$\gamma$ table	GenMarkowitz

The business lesson is not “pick GenTD3.” It is “separate the scenario engine from the decision engine.” Once the conditional generator exists, firms can test multiple allocators on the same generated state-conditioned future paths.

That is operationally useful. It turns model development into a modular process rather than a single heroic end-to-end black box.

What the paper directly shows

The paper directly supports four claims.

First, adaptive conditional diffusion can be formulated for time-series generation with theoretical error bounds under adapted Wasserstein distance. This is the core mathematical contribution.

Second, dynamic mean-variance value functions are stable when the surrogate model is close to the real model under that metric. This is the decision-theoretic bridge.

Third, a policy-gradient / TD3-style agent can train on generated or known scenario pools and produce competitive allocation policies in the paper’s experiments.

Fourth, on the ten-industry real-data backtest, diffusion-based strategies outperform the selected benchmarks on important risk-adjusted metrics, with the exact winner depending on the risk-aversion parameter and metric.

That is already enough. There is no need to inflate it into “diffusion models reinvent asset management.” They may reinvent part of the portfolio research stack. Asset management itself remains stubbornly attached to clients, constraints, and being blamed for drawdowns.

What Cognaptus infers for business use

The strongest business implication is that diffusion models may be more valuable as scenario engines than as return predictors.

A return predictor gives you a number. A scenario engine gives you a conditional distribution of possible futures. That distribution can feed stress testing, allocation training, model comparison, drawdown planning, and governance review. It is the difference between “we forecast 7%” and “here are 500 plausible state-conditioned paths; here is how each candidate allocation behaves.”

For an asset manager, wealth platform, or quant research team, the architecture suggests a practical stack:

Layer	Business function
Data layer	Curate recent and long historical return windows; include macro or factor variables if validated.
Conditional generator	Produce market scenarios conditional on current state.
Strategy layer	Test Markowitz, RL, rule-based, or constrained optimiser policies on the same scenario pool.
Risk layer	Evaluate drawdowns, volatility, turnover, tail behaviour, and stress-path behaviour.
Governance layer	Compare simulator regimes, retraining cadence, and sensitivity to $\gamma$ or mandate constraints.

The likely ROI is not “higher returns next month.” It is cheaper and richer experimentation under market uncertainty. A firm can stress candidate allocation rules across many generated paths instead of overfitting to one realised path and then pretending the backtest was destiny. Charming tradition, but not a risk process.

What remains uncertain

The main limitations are specific and consequential.

First, the theoretical results rely on technical assumptions about data distributions, score approximation, boundedness, and dissipativity. The authors themselves identify relaxing Assumption 2 and improving the error rates as future work. This does not invalidate the theory, but it narrows the conditions under which the clean guarantees can be interpreted literally.

Second, the real-data test is based on monthly excess returns for ten industry portfolios. That is a useful benchmark universe, but it is not the same as a high-dimensional multi-asset portfolio with daily trading, transaction costs, taxes, liquidity limits, leverage rules, benchmark-relative constraints, and human oversight.

Third, the GenTD3 strategy is retrained only once during the evaluation period. The authors note that more frequent retraining would likely be preferable. In production, retraining frequency would need to be tested as a first-class design choice, not left as an implementation footnote.

Fourth, the paper does not establish live-trading performance. It reports backtests. Good backtests are useful. Bad backtests are entertainment. This one is useful, but still a backtest.

Fifth, the framework’s value depends on simulator fidelity in precisely the regimes investors care about: crises, transitions, liquidity squeezes, and correlation breakdowns. Monthly ten-industry data can only tell part of that story.

The real contribution is a better rehearsal space

The cleanest way to understand the paper is not as a new optimiser, but as a better rehearsal space for portfolio decisions.

The diffusion model creates conditional market scenarios. The adapted Wasserstein theory explains why time-series structure and non-anticipation matter. The stability result connects scenario accuracy to mean-variance value. The RL agent then practises inside that generated environment.

This is a sensible direction for AI in finance because it avoids the most juvenile version of the field: asking a model to say whether the market goes up or down, then acting surprised when reality declines the invitation.

The more mature question is: can we build state-conditioned simulators that let strategies be trained, compared, and stress-tested before capital is exposed?

This paper’s answer is: under meaningful assumptions, with promising empirical evidence, yes—at least enough to deserve attention. Not enough to fire the risk committee. Unfortunately for everyone who enjoys clean org charts, the risk committee lives.

Cognaptus: Automate the Present, Incubate the Future.

Ahmad Aghapour, Erhan Bayraktar, and Fengyi Yuan, “Solving dynamic portfolio selection problems via score-based diffusion models,” arXiv:2507.09916, 2025, https://arxiv.org/abs/2507.09916. ↩︎

TL;DR for operators#

The old portfolio problem is not optimisation. It is the thinness of reality#

The mechanism: turn scarce history into a conditional scenario engine#

The score network is the engine; the RNN is the memory#

The theory says simulator error should matter for portfolio value#

The policy agent trains in a generated market, not in the archive#

The synthetic experiment checks the RL machinery, not the full thesis#

The real-data experiment is the main evidence#

Risk aversion is not a footnote; it changes the conclusion#

What the paper directly shows#

What Cognaptus infers for business use#

What remains uncertain#

The real contribution is a better rehearsal space#