Residual Learning: How Reinforcement Learning Is Speeding Up Portfolio Math

TL;DR for operators

Financial AI is usually sold as a machine that predicts markets. This paper is about something more modest and, frankly, more useful: making the maths underneath portfolio optimisation and option pricing run faster.

The authors propose a reinforcement learning controller that adjusts the block size of a preconditioner inside Flexible GMRES, an iterative solver used for large sparse or awkward linear systems. The agent is trained with PPO. Its state is the current residual vector, its action is a choice of block size, and its reward pushes the residual norm downward. In plain English: the model watches how badly the solver is still missing the answer, then changes the way the solver reorganises the problem.

The paper’s evidence is convergence evidence. On real-world portfolio matrices of sizes 4,008, 16,955, and 33,833, the PPO-guided method converges faster than a constant block-size preconditioner. On synthetic option-pricing systems of sizes 1,000 and 2,000, with densities 0.01 and 0.05, the PPO curves reach low residual levels in fewer iterations, sometimes in as few as two iterations. That is not the same as proving better trading performance, better option valuation models, or cheaper production deployment. Nobody has discovered Sharpe ratio in a preconditioner. Calm down.

The useful business interpretation is narrower and stronger: for firms repeatedly solving large $Ax=b$ systems in portfolio construction, risk engines, scenario analysis, or numerical derivatives pricing, adaptive solver tuning could reduce latency and compute waste. The adoption question is not “does RL understand finance?” It is “can an RL controller reliably reduce solve time on the matrices your systems actually generate, after training cost, stability constraints, and comparison against mature numerical methods are counted?”

The finance problem quietly becomes $Ax=b$

The familiar version of portfolio optimisation sounds like finance: choose asset weights, target a return, control variance, respect constraints. The operational version often sounds like numerical linear algebra. Mean-variance optimisation, once written through its first-order conditions, becomes a system of equations. In the paper’s formulation, the Karush-Kuhn-Tucker conditions produce a coefficient matrix containing the covariance matrix, the all-ones constraint vector, and the expected-return vector. Solving for the portfolio weights and multipliers becomes solving $Ay=b$.¹

Option pricing follows a different road to the same neighbourhood. Start with a Black-Scholes-style partial differential equation, discretise time and the asset-price grid with finite differences, and the continuous pricing equation becomes a matrix equation at each time step. The resulting matrix in the paper’s one-dimensional setup is tridiagonal, with coefficients shaped by volatility, the risk-free rate, the time step, and the asset-price grid.

So the paper is not really asking whether reinforcement learning can become a portfolio manager. It is asking whether reinforcement learning can help an iterative numerical method reach a small residual faster.

That distinction matters. In production finance, the model is often not the only bottleneck. Large portfolios, multi-asset products, fine pricing grids, scenario sweeps, and repeated recalibration create many linear systems. Direct matrix inversion becomes expensive as dimensions grow. Iterative solvers become attractive because they avoid full inversion, but they bring their own problem: convergence may be painfully slow when the system is ill-conditioned, non-symmetric, sparse, or structurally uneven.

This is where the paper places the AI. Not in the investment committee. In the solver room, wearing a high-visibility vest and trying to stop the residual from loitering.

Preconditioning is the trick before the trick

Iterative solvers work by gradually improving an approximate answer. The residual,

$$ r=b-Ax, $$

measures what is still unexplained by the current solution. Smaller residual, better approximation. The basic ambition is simple: reduce the residual quickly enough that the solver becomes usable inside a real workflow rather than a decorative academic appliance.

Preconditioning changes the system into an easier equivalent or approximate form. Instead of attacking the original matrix directly, the solver uses another matrix that improves the numerical behaviour of the problem. A good preconditioner can compress the number of iterations required. A bad or badly tuned one can add overhead without enough convergence benefit. Mathematics, as usual, gives with one hand and invoices with the other.

The authors use a block preconditioner. The matrix $A$ is partitioned into smaller blocks, and QR decomposition is applied inside those blocks. The block-preconditioned structure is then used inside Flexible GMRES. The “flexible” part is important because the preconditioner can change during the iteration process. Standard GMRES expects a more fixed preconditioning setup; FGMRES is more comfortable when the preconditioner itself is adaptive.

The central parameter is block size. Smaller blocks may be cheaper to process but weaker as preconditioners. Larger blocks may capture more structure but cost more per application. The right value is not universal. It depends on matrix size, sparsity, density, conditioning, and structure. In portfolio problems, that structure may reflect covariance patterns, constraints, or asset-class groupings. In option-pricing systems, it may reflect discretisation and boundary effects.

The paper’s contribution is to treat block-size selection as a control problem rather than a manually tuned constant.

The PPO agent controls the preconditioner, not the portfolio

The mechanism is compact:

Component	In the paper	Operational interpretation
Solver	FGMRES	Iteratively solves large $Ax=b$ systems
Preconditioner	Block-partitioned QR-based preconditioner	Reorganises the numerical problem to improve convergence
RL algorithm	PPO	Learns a policy for choosing block size
State	Current residual vector	What the solver is still failing to explain
Action	Integer block-size adjustment	How strongly or granularly the matrix is partitioned
Reward	Negative residual norm	Incentive to reduce numerical error
Baseline comparison	Constant block-size preconditioning	Manual/static tuning alternative

The agent observes residual information during solving. It chooses a block size. The solver rebuilds or applies the corresponding preconditioner, performs Arnoldi iterations, solves the least-squares subproblem inside GMRES, updates the approximate solution, and computes a new residual. Over many episodes, PPO updates its policy so that block-size choices become more useful for residual reduction.

That mechanism is the point. The reinforcement learning is not forecasting returns. It is not deciding whether to buy Nvidia, short volatility, or rotate into cash because it saw a chart pattern on social media. It is adapting an internal numerical parameter in response to solver behaviour.

This is also why the accepted interpretation should stay disciplined. The paper demonstrates faster convergence of a solver under a particular adaptive preconditioning scheme. It does not demonstrate investment alpha, pricing superiority, risk reduction in portfolios, or better economic decisions. Those benefits could follow only if the solver acceleration transfers into a larger production process where numerical solve time is a binding constraint.

What the experiments are actually doing

The paper’s experimental section is short, but its structure is clear. The figures are not separate theories; they are convergence comparisons under different matrix settings.

Test or figure group	Likely purpose	What it supports	What it does not prove
Portfolio matrices, size 4,008 with 8,188 non-zeros	Main evidence on a smaller real-world portfolio matrix	PPO-adaptive block sizing can beat constant block-size preconditioning in residual convergence	Production runtime savings, economic portfolio improvement, or dominance over all preconditioners
Portfolio matrices, size 16,955 with 37,849 non-zeros	Main evidence at a larger and denser portfolio scale	The convergence advantage is not confined to the smallest case	Robustness across all covariance structures or constraints
Portfolio matrix, size 33,833 with 73,249 non-zeros	Main evidence at the largest reported portfolio scale	The method still improves convergence on a much larger non-symmetric matrix	Scalability to million-variable institutional systems
Synthetic option-pricing matrices, size 1,000, densities 0.01 and 0.05	Sensitivity-style test across density	PPO reaches low residuals in fewer iterations than constant block size	Accuracy of the option-pricing model itself
Synthetic option-pricing matrices, size 2,000, densities 0.01 and 0.05	Sensitivity-style test across larger pricing systems	The speedup pattern remains visible at a larger synthetic size	Generality to complex products, stochastic volatility, jumps, American exercise, or full pricing libraries

The portfolio matrices come from real-world portfolio optimisation matrices identified through a sparse matrix collection. The paper reports three matrix scales: 4,008 with 8,188 non-zero elements; 16,955 with 37,849 non-zero elements; and 33,833 with 73,249 non-zero elements. The authors state that these matrices are non-symmetric, which matters because GMRES-family methods are designed for non-symmetric systems where simpler symmetric solvers may not apply cleanly.

The option-pricing tests are synthetic. The paper uses matrices of size 1,000 and 2,000 at densities 0.01 and 0.05, with volatilities ranging from 5% to 25% and a 1% risk-free rate. These are not market backtests. They are numerical solver experiments on systems generated from pricing discretisations.

That is a narrower claim, but not a weak one. Numerical infrastructure does not need glamour. It needs to stop wasting cycles.

Faster residual convergence is useful, but it is not the same as faster business value

The result pattern is consistent: PPO-guided adaptive block sizing reduces residuals faster than a constant block-size method across the reported portfolio and option-pricing cases.

In the option-pricing figures, the contrast is visually direct. For matrix size 1,000, the PPO-based solver reaches low residual levels after two to three iterations depending on density, while the constant block-size method continues through five iterations. For matrix size 2,000, the same broad pattern appears: the adaptive method drops the residual faster, and the constant method takes more iterations to reach comparable low-residual territory. The authors state that in some cases the RL agent reduces the number of iterations required for option-pricing systems to as few as two.

Iteration reduction is valuable only if it survives the accounting department. A more expensive iteration can still lose to a cheaper method with more iterations. The paper acknowledges that RL training cost is nontrivial, but it does not provide a full wall-clock cost model, training budget, inference overhead, memory profile, or comparison against a broader suite of preconditioners such as ILU, multigrid, or domain decomposition.

So the result should be read as a convergence result first. It indicates that adaptive block-size control can improve solver behaviour. It does not yet quantify total cost of ownership.

This distinction is not pedantry. In production, an adaptive preconditioner has several cost layers:

training the PPO policy;
storing and serving the policy;
rebuilding or changing preconditioners during solve time;
computing residual-derived state;
maintaining numerical stability and reproducibility;
integrating the method into existing pricing, risk, or optimisation engines.

If the same family of matrices appears repeatedly, those costs can be amortised. That is the interesting case. A market-maker repricing related products, a risk engine running daily or intraday scenarios, or a portfolio platform rebalancing across similar universes may produce matrix families with recurring structure. In that setting, learning a policy once and reusing it many times begins to make sense.

If every matrix is a one-off alien artefact from a different mathematical planet, training an RL controller may be overkill. A tuned conventional method, or even a boring heuristic, may win. Boring heuristics win more often than conference abstracts prefer to admit.

The business value is latency compression in repeated numerical workloads

The plausible business pathway starts with repeated solves.

Portfolio construction systems often need to update weights as expected returns, covariance estimates, constraints, or risk budgets change. Option-pricing systems may solve discretised equations repeatedly across strikes, maturities, scenarios, and calibration loops. Risk engines may run thousands of related valuations under stress assumptions. In all of these cases, shaving solver iterations can matter if the linear system solve is a meaningful share of runtime.

The paper’s method is best understood as adaptive infrastructure for numerical finance:

Technical contribution	Operational consequence	ROI relevance
PPO chooses block-preconditioner size dynamically	Less manual tuning of solver parameters	Lower engineering time if matrix families recur
FGMRES supports changing preconditioners	Adaptive choices can be made during iteration	Better fit for non-stationary residual behaviour
Residual norm drives the reward	Optimisation target matches solver convergence	Cleaner objective than vague “finance performance”
Tests show fewer iterations than constant block size	Potential latency reduction	Valuable only if per-iteration and training overhead do not erase the gain
Works on reported non-symmetric portfolio matrices	Relevant to awkward real-world numerical systems	Still requires validation on each institution’s own matrices

The strongest practical use case is not a flashy front-office AI assistant. It is a back-end numerical acceleration layer.

A bank, hedge fund, exchange, or analytics vendor could evaluate this kind of method by taking archived matrix workloads from existing systems and replaying them through several solver configurations. The benchmark should not stop at residual plots. It should include wall-clock time, GPU or CPU utilisation, memory pressure, failure rates, stability under changing market conditions, and comparison against well-tuned classical baselines.

If the adaptive policy generalises across matrix families, it could reduce latency for repeated workloads. If it fails to generalise, it becomes another machine-learning component that performs beautifully in a figure and sulks in production. We have met the species.

The paper’s evidence is promising but incomplete by design

The paper is published as a workshop paper, and it reads like an early mechanism demonstration rather than a complete deployment study. That is not a flaw by itself. But it sets the boundary for interpretation.

The main limitation is that the paper reports convergence behaviour, not full runtime economics. Faster residual reduction usually helps, but total runtime depends on the cost of each adaptive step. Rebuilding block preconditioners with QR decomposition has its own cost. PPO inference is probably small relative to large matrix operations, but “probably” is not a benchmark.

The second limitation is baseline coverage. The constant block-size comparison is sensible because it isolates the adaptive block-size choice. But production numerical teams will ask a broader question: how does this compare with mature preconditioning strategies, tuned restart parameters, ILU variants, multigrid methods, domain decomposition, or problem-specific solvers? The paper’s answer is not yet complete.

The third limitation is generality. The portfolio tests use real-world sparse matrices, which strengthens the evidence. The option-pricing tests are synthetic and relatively controlled. The method may behave differently under more complex derivatives settings: multi-asset grids, stochastic volatility, jumps, early exercise features, penalty methods, or hybrid PDE-Monte Carlo pipelines.

The fourth limitation is reproducibility detail. The paper gives the algorithmic outline and describes the PPO setup at a high level, but it does not provide enough detail in the main text to fully reconstruct architecture choices, training episode counts, hyperparameters, or cost budgets. For a research note, that is survivable. For production adoption, it is the beginning of the due diligence list.

The useful lesson is adaptive numerical plumbing

The fashionable mistake would be to file this under “RL for finance” and imagine a reinforcement learning agent learning markets. The more accurate and more interesting reading is “RL for numerical control”.

That category matters. Many enterprise AI gains will not come from replacing experts with chatbots or letting agents roam through strategic decisions while everyone claps nervously. They will come from adaptive systems tuning the dull but expensive machinery that already runs modern organisations: solvers, schedulers, optimisers, simulators, compilers, databases, and pipelines.

This paper sits neatly in that world. It takes a parameter that would otherwise be fixed or manually tuned, observes the system’s live numerical state, and adapts the parameter to reduce residual faster. That is a practical pattern. It is also less theatrical than most AI-finance narratives, which is usually a good sign.

For financial institutions, the next question is empirical and local. Take your matrix families. Measure where solve time actually hurts. Compare adaptive PPO-guided preconditioning against your best conventional baselines. Include training cost. Include runtime overhead. Include failure cases. Then decide whether RL belongs in the solver loop.

The paper gives a plausible mechanism and encouraging convergence evidence. It does not give a free lunch. It gives a better way to ask whether lunch preparation can be automated.

Cognaptus: Automate the Present, Incubate the Future.

Hadi Keramati and Samaneh Jazayeri, “Accelerated Portfolio Optimization and Option Pricing with Reinforcement Learning,” arXiv:2507.01972, 2025. https://arxiv.org/abs/2507.01972 ↩︎

TL;DR for operators#

The finance problem quietly becomes $Ax=b$#

Preconditioning is the trick before the trick#

The PPO agent controls the preconditioner, not the portfolio#

What the experiments are actually doing#

Faster residual convergence is useful, but it is not the same as faster business value#

The business value is latency compression in repeated numerical workloads#

The paper’s evidence is promising but incomplete by design#

The useful lesson is adaptive numerical plumbing#