Pareto on Autopilot: Evolving RL Policies for Messy Supply Chains

A supply chain rarely fails because one objective was neglected in a spreadsheet. It fails because the spreadsheet quietly pretended the objective would stay still.

Yesterday the priority was margin. Today it is carbon exposure. Tomorrow a route becomes expensive, a supplier becomes unreliable, demand arrives in a pattern that looks suspiciously like a sine wave wearing a hard hat, and the “optimal” plan starts ageing like milk.

That is the operational problem behind MORSE, a proposed framework from Niki Kotecha and Ehecatl Antonio del Rio Chanona: Multi-Objective Reinforcement Learning via Strategy Evolution for Supply Chain Optimization.¹ The important part is not just that the method uses reinforcement learning. That, by itself, is no longer enough to make anyone spill their coffee. The useful shift is that MORSE does not train one heroic policy and ask it to be wise forever. It evolves a Pareto front of neural policies: a portfolio of behaviours, each making a different trade-off among profit, emissions, and lead time.

That distinction matters. A Pareto front of solutions is a menu of static answers. A Pareto front of policies is closer to an operating playbook.

The real product is not an optimum; it is a policy switchboard

The common shortcut in multi-objective optimisation is scalarisation: collapse several objectives into one weighted score, train or optimise against that score, then call the resulting compromise “optimal”. This is tidy. It is also brittle. Weights chosen in planning meetings have a mysterious habit of becoming wrong during disruptions.

MORSE avoids that by keeping the objectives separate during search. The framework maintains a population of neural-network policies. Each policy is evaluated inside a simulated multi-objective inventory environment. Instead of asking which policy has the best single score, MORSE uses non-dominated sorting: a policy survives if no other policy is better across all objectives at once. Crowding distance then encourages diversity, so the final front is not thirty copies of the same compromise wearing different shoes.

The mechanism is simple enough to be useful:

Mechanism	What it does technically	Operational translation
Population of neural policies	Maintains many candidate policies, each with different network parameters	Keeps a catalogue of operating behaviours rather than one locked-in plan
Multi-objective evaluation	Scores each policy on profit, emissions, and lead time	Preserves the business trade-off instead of hiding it inside a weighted average
Non-dominated sorting	Keeps policies that are not dominated across all objectives	Removes obviously inferior strategies without pretending one KPI decides everything
Crowding distance	Rewards spread across the Pareto front	Prevents the policy catalogue from collapsing into one narrow region of the trade-off space
Mutation and crossover	Generates new policies from selected parents	Searches globally through policy space rather than nudging one policy by gradient updates
CVaR-based evaluation	Scores policies by tail outcomes, not just mean returns	Makes worst-case behaviour visible before operations discover it the expensive way

This is why the paper is best read mechanism-first. The experiments matter, but they support the deeper move: MORSE changes the unit of decision from “the plan” to “the set of switchable policies”.

The agent controls quantity and transport mode, not just a toy action

The inventory-control agent in MORSE operates in a multi-echelon, multi-product supply chain model. The objectives are cumulative profit, transportation emissions, and lead time. The environment includes stochastic lead time and stochastic customer demand. Demand is tested under both ordinary Poisson arrivals and a non-stationary seasonal Poisson process.

The agent’s state includes on-hand inventory, pipeline inventory, backlog, and fixed windows of past demand and order history. That last detail is not glamorous, but it is practical. Supply chains are partially observable; pretending the current inventory snapshot contains the entire truth is a useful way to manufacture avoidable surprises. MORSE uses a fixed history window instead of a recurrent neural network, favouring implementation simplicity over architectural elegance. Sensible. Not everything needs an LSTM just because someone found one in the drawer.

The policy outputs two kinds of decisions:

Order replenishment, modelled as a continuous action sampled from a Gaussian distribution and then scaled into a feasible replenishment quantity.
Transportation mode, modelled as a discrete categorical choice, such as truck, rail, or air.

This combination matters because real inventory control is not merely “how much should we order?” It is also “how should we move it?” The policy therefore touches both volume and logistics mode, giving it room to trade margin, emissions, and speed.

Evolutionary RL is doing the search that scalar weights usually dodge

MORSE uses a multi-objective evolutionary algorithm in the spirit of NSGA-II to search over neural policy parameters. Each policy is rolled out across episodes. Its vector of objective returns is recorded. Policies are ranked by Pareto dominance. Selected policies reproduce through crossover and mutation. The next generation keeps the best-ranked and most diverse candidates.

That design has a practical interpretation. Instead of rerunning a dynamic multi-objective optimiser whenever the world changes, the organisation trains a policy catalogue in advance. During operations, the decision maker can select a policy that fits the current regime.

This is not the same as saying the policy catalogue makes decisions “automatically” in the governance sense. The paper’s adaptive examples assume a switch to a more favourable policy when conditions change. In production, the switch rule would need to be explicit: who authorises it, what KPI thresholds trigger it, how long a policy must remain active before another switch, and whether a warehouse can absorb the operational whiplash. The algorithm gives the switchboard. It does not absolve management of owning the switches. Nice try, management.

The disruption tests show adaptive trade-offs, not magic domination

The paper evaluates MORSE across three inventory configurations: a three-node, two-product network with seasonal demand; a similar three-node setup with Poisson demand; and a five-node, two-product network with Poisson demand. The adaptive behaviour analysis then introduces two disruption families.

The first is an emission-penalty scenario, where an emission tax is introduced at time step 200 and the system switches to a more suitable policy from the Pareto front. The second is a geopolitical-tension scenario, represented by a 10% cost increase over a period, again motivating a switch to a different policy.

The important reading discipline is to classify the evidence correctly.

Paper component	Likely purpose	What it supports	What it does not prove
Emission-penalty tests across three configurations	Main adaptive-behaviour evidence	Policy switching can change the profit–emissions–lead-time trade-off after a regulatory-style shock	That the same switch logic will work under real carbon markets or real compliance constraints
Geopolitical-cost tests	Main adaptive-behaviour evidence / scenario extension	The Pareto policy set can be used to respond to cost shocks without retraining from scratch	That “geopolitics” has been fully modelled; the test is a stylised cost surge
CVaR-trained reward distributions	Robustness and risk-sensitivity test	Tail-aware policy evaluation can improve worst-case outcomes relative to mean-trained policies	That all operational tail risks are captured, or that a single sampled policy proves catalogue-wide superiority
CAPQL and MONES benchmark	Comparison with prior MORL work	MORSE is competitive, and the paper reports superior performance in the inventory case	Universal dominance across all domains, simulators, or action spaces

In the emission-penalty figures, the switch is not a free lunch. In the seasonal three-node configuration, switching protects profit and reduces emissions after the penalty, but at the cost of higher lead time. That is exactly what a Pareto system should reveal. If the result showed every KPI improving with no trade-off, the correct response would be suspicion, not applause.

In the Poisson and five-node configurations, the pattern varies, but the same lesson holds: the value is not “MORSE always wins every metric.” The value is that the operator has a structured way to move along the trade-off surface when the world changes. The policy is not optimal in the abstract. It is appropriate to a regime.

The geopolitical scenario makes the same point from a cost-shock angle. Configuration B explicitly shows higher profitability at the expense of higher emissions. Again, this is useful because it is uncomfortable. It says the method can expose the price of resilience, not pretend resilience is a scented candle.

CVaR turns the policy catalogue from average-smart to tail-aware

Mean performance is a dangerous comfort object. A supply-chain policy can look good on average and still produce ugly outcomes in the lower tail: severe delays, emission spikes, or profit collapses under unlucky demand and lead-time combinations.

MORSE adds a risk-aware version using Conditional Value-at-Risk, or CVaR. In the modified framework, policies are evaluated using empirical return distributions from sampled episodes. CVaR is computed for each objective and used as the policy’s fitness vector inside the evolutionary process, rather than being added as a fourth objective.

That choice is subtle and important. CVaR is not treated as another KPI beside profit, emissions, and lead time. It changes how those KPIs are evaluated. The policy is no longer judged merely by typical outcomes; it is judged by what happens in the bad tail.

For profit, the bad tail means low returns, so better CVaR means improving the worst profit outcomes. For emissions and lead time, the bad tail means high emissions or long delays, so better CVaR means reducing those extremes. In the paper’s robustness analysis, the authors compare a randomly selected CVaR-trained policy against a mean-trained policy using 1,000 Monte Carlo simulations. The reported result is directionally clean: CVaR-trained policies improve the relevant tail metric across objectives.

The boundary is just as important. This is not a live risk-control system. It is a simulation-based robustness test. It shows that the evolutionary framework can absorb a tail-risk scoring rule and produce policies with better worst-case behaviour under the modelled uncertainty. That is valuable. It is not a guarantee against supplier bankruptcy, port shutdowns, data drift, or the ancient enterprise tradition of updating master data once per geological era.

The benchmark result is promising, but read the bar chart like an adult

The paper benchmarks MORSE against CAPQL and MONES, two multi-objective reinforcement learning methods selected for relevance to continuous action spaces and multi-objective optimisation. The benchmark figure reports stronger performance for the proposed MOEA-RL approach in the inventory case.

The strongest visual separation appears on profit and emissions: MORSE achieves higher cumulative profit and lower total emissions than the alternatives in the reported comparison. The lead-time panel is closer, especially relative to MONES, while CAPQL is clearly worse on average lead time. The paper describes the method as outperforming both baselines across objectives; a careful business reader should interpret the result as strong evidence in this simulated inventory setting, not a law of nature.

That matters because benchmarking in RL is notoriously sensitive to environment design, reward scaling, hyperparameters, and simulator fidelity. MORSE’s advantage is plausible: evolutionary search naturally maintains diverse candidate policies, which is helpful in non-convex, multi-objective settings. But “plausible and demonstrated here” is not the same as “procurement should rewrite its entire optimisation stack by Friday.”

What the paper directly shows, and what Cognaptus would infer

The paper directly shows that MORSE can generate a Pareto front of neural inventory-control policies in simulated multi-echelon supply chains. It shows that switching among those policies can help respond to stylised emission penalties and cost shocks. It shows that CVaR can be incorporated into the evolutionary evaluation loop to produce more tail-aware policies. It also reports favourable benchmark performance against CAPQL and MONES in the chosen inventory-management case.

The business inference is narrower and more useful: MORSE suggests a way to treat supply-chain decision logic as a managed policy portfolio.

That portfolio could support operations in three practical ways.

First, it could reduce the need to recompute a full optimisation plan whenever priorities shift. A logistics team could maintain a front of candidate behaviours: margin-protecting, emissions-conservative, lead-time-aggressive, and tail-risk-averse.

Second, it could make trade-offs explicit. Instead of arguing over whether sustainability “matters,” the organisation can see what the low-emissions policy costs in lead time or profit under the simulator. This will not end executive debates, obviously. But it can at least make the debate less vibes-based.

Third, it could support shock rehearsal. Firms already run financial stress tests; supply-chain teams could run operational stress books. What happens if transport costs rise 10%? What happens if emissions thresholds tighten? What if seasonal demand peaks earlier? The Pareto catalogue becomes a library of possible responses.

A pilot should test switching governance before algorithmic elegance

A serious pilot should not begin with “deploy MORSE into production and see if procurement screams.” Start in shadow mode.

The practical pilot design would look like this:

Pilot layer	What to test	Why it matters
Simulator fidelity	Whether demand, lead time, costs, capacities, distances, and emission factors match operational reality	A policy trained on a fantasy simulator is just a confident hallucination with a warehouse badge
Reward design	Whether profit, emissions, and lead-time rewards reflect actual P&L and service constraints	Bad reward design creates elegant policies for the wrong business
Policy catalogue quality	Whether the Pareto front contains genuinely different behaviours	A “portfolio” of near-identical policies is theatre
Switching rule	When the system recommends moving from one policy to another	The business value depends on controlled switching, not merely having alternatives
Tail-risk validation	Whether CVaR-trained policies reduce bad outcomes on held-out disruptions	Tail-risk claims need stress testing outside the training distribution
Human override	How planners approve, reject, or modify recommended switches	Operators know constraints the model may not see, especially the undocumented ones, which are of course the most important

The key is to measure recommendation quality before handing over control. Run MORSE beside existing planning processes. Compare recommended actions, expected KPI impacts, and realised outcomes. Track not only average performance but regret: where did the policy catalogue recommend a switch that looked good in simulation and bad in reality?

The hard boundary is not the algorithm; it is the operating system around it

MORSE is strongest as a research prototype for dynamic multi-objective control. Its boundaries are clear.

The evidence is simulation-based, not a live deployment. The environments use stylised stochastic demand and lead-time assumptions, including Poisson and seasonal Poisson demand. Real supply chains include contract constraints, batch sizes, minimum order quantities, labour availability, supplier negotiation, customs delays, and the small matter of people ignoring systems when incentives disagree.

The switching examples assume policies can be changed at the disruption point. In real operations, switching has friction. Transport contracts may not allow instant mode changes. Warehouses may have labour schedules. Suppliers may have cut-off times. A low-emissions policy that requires unavailable capacity is not a policy; it is a wish.

The CVaR extension improves tail awareness under sampled uncertainty, but it depends on the tails being represented in the simulator. If rare events are missing from the training distribution, CVaR will optimise the wrong tail with admirable precision.

Finally, MORSE currently frames the supply chain as a single-agent control problem. The authors themselves identify multi-agent extensions as future work. That matters because supply chains are full of semi-independent actors with their own objectives. A supplier does not become obedient just because a neural policy found a Pareto-efficient replenishment decision. Rude, but true.

The useful lesson: stop asking for one best supply-chain policy

The most practical idea in MORSE is not that evolutionary reinforcement learning is clever. It is that supply-chain control should not be forced into one brittle compromise.

A changing operation needs a repertoire. MORSE offers a way to train that repertoire as a Pareto front of policies, then use switching to respond when emissions penalties, cost shocks, or demand patterns change. CVaR adds a second layer: not only “which policy performs well on average?” but “which policy behaves less badly when the tail shows up with a clipboard?”

That is the business value: not full autonomy, not optimisation magic, and certainly not the end of planners. The value is a more explicit operating surface for trade-offs that already exist.

The future supply-chain stack will not be one optimiser sitting on a throne. It will be a governed catalogue of policies, stress-tested against scenarios, monitored for drift, and selected according to current priorities. Less oracle, more cockpit.

For messy supply chains, that is probably the right level of ambition.

Cognaptus: Automate the Present, Incubate the Future.

Niki Kotecha and Ehecatl Antonio del Rio Chanona, “MORSE: Multi-Objective Reinforcement Learning via Strategy Evolution for Supply Chain Optimization,” arXiv:2509.06490, 2025, https://arxiv.org/abs/2509.06490. ↩︎

The real product is not an optimum; it is a policy switchboard#

The agent controls quantity and transport mode, not just a toy action#

Evolutionary RL is doing the search that scalar weights usually dodge#

The disruption tests show adaptive trade-offs, not magic domination#

CVaR turns the policy catalogue from average-smart to tail-aware#

The benchmark result is promising, but read the bar chart like an adult#

What the paper directly shows, and what Cognaptus would infer#

A pilot should test switching governance before algorithmic elegance#

The hard boundary is not the algorithm; it is the operating system around it#

The useful lesson: stop asking for one best supply-chain policy#