Liquidity is boring until it vanishes.
Most investors notice market makers only when the screen suddenly looks thin: fewer bids, wider spreads, worse execution, and the faint smell of panic priced into every click. A market maker’s job is not glamorous. It quotes buy and sell prices, earns the spread, manages inventory, and tries not to become the proud owner of too much of the wrong asset at the wrong moment. Finance, as usual, rewards the person who stands calmly in the middle of everyone else’s urgency.
Óscar Fernández Vicente’s dissertation, Market Making Strategies with Reinforcement Learning, asks whether reinforcement learning can make that middleman more adaptive.1 Not merely faster. Not merely more mathematical. Adaptive.
That distinction matters. A trading algorithm can be fast and still brittle. It can be profitable in one market regime and decorative in the next. It can win a backtest, lose live capital, and then explain nothing because the explanation lives somewhere inside a neural network’s smug little fog machine.
The dissertation’s real contribution is not “AI trader makes money.” That would be too easy, and suspiciously LinkedIn-compatible. The useful reading is a staged engineering story. First, train a deep Q-learning market maker to provide liquidity profitably in simulation. Then discover that profitability without inventory discipline is only half a system. Then replace hand-crafted reward compromise with a multi-objective framework. Finally, stop pretending that one fixed policy can survive changing markets and introduce a policy-weighting method for non-stationary contexts.
It is a roadmap, not a production trading desk.
The paper builds a staircase, not a magic trader
The dissertation progresses through four technical stages, each solving a problem created by the previous stage.
| Stage | Technical move | Main purpose | Business reading |
|---|---|---|---|
| 1 | Deep Q-learning market maker | Learn profitable simulated quoting and hedging actions | RL can discover useful liquidity-provision behaviour under controlled market dynamics |
| 2 | AIIF and DITF reward engineering | Add dynamic inventory control | Profit must be balanced against balance-sheet exposure |
| 3 | Multi-objective RL with Pareto fronts | Separate profitability and inventory objectives | Risk appetite becomes a selectable operating policy, not a buried reward hack |
| 4 | POW-dTS policy weighting | Adapt among pre-trained policies under context changes | Non-stationary markets need policy orchestration, not one heroic model |
This timeline-first structure is the right way to read the work because each phase is an answer to a failure mode. The first model learns to make money, but inventory is weakly controlled. Reward engineering adds control, but requires prior assumptions about the desired utility function. Multi-objective RL separates the trade-off, but still needs policies suitable for changing environments. POW-dTS then treats policies as a library to be weighted dynamically.
The story is not “the agent becomes intelligent.” The story is “the system gradually becomes less naive.”
Progress, sadly, is often just disciplined disappointment with the previous version.
Stage one: the agent learns to quote before it learns to worry
The first experimental stage maps market making into a reinforcement-learning problem. The agent observes market and internal variables such as recent buys and sells, inventory, mid-price variation, current and previous spread, and traded volume. It chooses from a discrete action space: buy spread, sell spread, and how much inventory to hedge. The model uses a deep Q-network with three hidden layers of 32 neurons and 605 output actions.
The environment is ABIDES, an agent-based financial market simulator. The setup includes noise agents, value agents, momentum agents, an adaptive percentage-of-volume market maker, an exchange agent, and additional investor agents that choose the cheapest available market maker. This design is important because the model is not trading against a spreadsheet. It is trading inside a synthetic market where other agents create price movement, spread dynamics, and competitive pressure.
In the single-agent market-making experiment, the DQL market maker is compared against random and persistent market makers over 250 simulations. The result is directionally clear: the DQL-MM achieves a mean reward of 159 in USD $\times 10^3$, while the random and persistent agents average -23 and -37 respectively. That is the dissertation’s first useful evidence point: the RL agent can learn profitable quoting behaviour in this simulated setting.
The multi-agent version is less tidy, which makes it more useful. When three DQL market makers compete together, only one earns strong positive average returns: DQL-MM #1 reaches 129, while #2 and #3 average -45 and -5. The simple moral is that adding learning competitors makes the environment harder. The more interesting lesson is that a market-making policy is partly a response to other market makers. A strategy that works against simple baselines may become less effective when other adaptive agents arrive.
This is the first business-relevant warning. An RL trader is not a static asset. It is an ecological participant. Its performance depends on the other species in the market swamp.
The direct transfer experiment sharpens this point. A pre-trained single-agent policy produces high average returns when transferred into a competitive environment, but it also shows much higher volatility. The still-learning agent earns less but behaves more consistently. In business terms, the “best historical policy” is not automatically the best operating policy. A fixed winner can become fragile once the game changes.
That is where the dissertation’s second stage begins.
Stage two: inventory turns profit into a risk problem
A market maker can look profitable while quietly accumulating a dangerous inventory position. This is not a technical footnote. It is the job.
The dissertation first tackles inventory control through reward engineering. Instead of applying a fixed penalty to inventory, it introduces two dynamic terms:
- AIIF, the Alpha Inventory Impact Factor, which controls the strength of inventory-risk aversion.
- DITF, the Dynamic Inventory Threshold Factor, which adjusts the tolerated inventory level according to the agent’s cash-to-inventory value situation.
The mechanism is intuitive. A market maker with a healthier cash-to-inventory ratio can tolerate more inventory than one already leaning heavily into the asset. A static penalty treats those cases too similarly. The proposed reward function changes the penalty as the agent’s own financial state changes.
The experiments show the expected trade-off. Lower AIIF values allow more aggressive quoting and higher mark-to-market performance, but with wider inventory exposure. Higher AIIF values constrain inventory, increase hedging, reduce trade capture, and can damage profitability. This is not a bug. It is the core economics of market making refusing to be wished away.
The reward-engineered approach is also compared against other reward functions, including full inventory penalty, asymmetrically dampened PnL, and PnL-only formulations. The strongest practical result is not that one coefficient is universally best. It is that PnL-only reward design performs poorly in both mark-to-market and inventory management. A model trained only to like profit discovers the ancient financial technique of ignoring risk until risk becomes the entire meeting agenda.
The dynamic reward function performs well, but it introduces a new problem. The operator must still choose the coefficients. One pre-trained agent is needed per configuration. If the desired utility function changes later, more training is required.
Reward engineering improves the system. It also exposes the limits of hiding a multi-objective business problem inside one scalar reward.
Stage three: Pareto fronts make the trade-off visible
The dissertation then moves from reward engineering to multi-objective reinforcement learning. This is the most important conceptual shift in the paper.
Instead of forcing profitability and inventory control into one scalar reward, the M3ORL agent receives a reward vector. One objective represents mark-to-market performance. The other represents inventory control. The architecture uses separate neural-network blocks for each objective, allowing the agent to learn the structure of each goal independently. A weight parameter then determines how strongly the selected action favours profitability versus inventory discipline.
That changes the business interface. Under reward engineering, the risk preference is buried inside a crafted function. Under the MORL framing, risk appetite becomes something closer to an operating dial.
The paper evaluates this using multi-objective metrics: hypervolume, sparsity, and the number of undominated solutions. MORL produces a stronger Pareto-front profile than the tested reward-engineered alternatives. The reported normalized hypervolume is higher for MORL than for RE-W and RE-AIIF, and MORL contributes more undominated solutions in the combined Pareto analysis. The exact numbers matter less than the direction: separating objectives gives the operator a richer frontier of usable policies.
This is where a business reader should slow down.
The value of MORL is not that it magically removes the profit-risk trade-off. It makes the trade-off inspectable. A trading firm can ask, “What do we give up in inventory control if we move toward a more aggressive mark-to-market policy?” That is a governance question, not only a modelling question.
It also improves diagnosis. If a model fails, the firm can ask whether the failure came from the profitability objective, the inventory-control objective, or the weight used to choose between them. This is much better than staring into a monolithic reward function and hoping the answer emerges out of professional courtesy.
The EMA test is a useful negative result
One of the dissertation’s more quietly valuable tests adds trend information to the MORL agent’s state space. The author introduces exponential moving average variables: long EMA, short EMA, and EMA slope. The likely purpose is robustness and sensitivity testing: does the model improve if it sees short-term trend structure explicitly?
The answer is no, at least for this strategy and environment.
Agents with EMA variables perform worse in training and testing, and extended training does not materially rescue the result. The paper suggests several plausible reasons: the state space becomes more complex, and the strategy is high-frequency in character, relying more on immediate spread and inventory conditions than longer trend signals.
This is a useful result because it resists feature hoarding. In trading systems, every team eventually faces the temptation to add one more signal. Trend, sentiment, macro surprise, order-book image, lunar phase if the budget allows. More information feels like more intelligence. Sometimes it is just more ways to overfit.
The lesson is narrow but valuable: features should match the action horizon. A market maker making rapid spread and hedge decisions may not benefit from trend variables designed for slower directional strategies. The model does not need “more context.” It needs the right context. A revolutionary thought, apparently.
Stage four: non-stationarity turns policies into a portfolio
By the final stage, the dissertation has a capable multi-objective market-making agent. But a single trained policy remains vulnerable to regime change. Financial markets do not politely remain in the distribution that made the backtest look clever.
The proposed answer is POW-dTS: Policy Weighting through discounted Thompson Sampling. Instead of relying on one policy, the system starts with a library of pre-trained M3ORL policies. These policies are trained under different competitive contexts: zero, one, five, and seven M3ORL market-maker competitors. The POW-dTS algorithm periodically evaluates policy performance and assigns operating time blocks to policies according to discounted Thompson sampling weights.
This is a subtle but important design choice. The policies are not switched randomly at every time step. They operate in blocks. That respects the sequential nature of reinforcement learning: an action can matter because of what it enables next, not only because of its immediate reward.
The non-stationarity experiments compare single-policy approaches, continual-learning variants, random policy-selection ablations, POW-dTS, and an ideal multi-policy benchmark. The ideal benchmark assumes perfect knowledge: a flawless context-change detector and perfect assignment of the best policy to each context. Naturally, this benchmark performs best. Reality also performs best when granted omniscience; finance has yet to productize this reliably.
POW-dTS is the strongest practical method below that ideal benchmark. Across its tested configurations, it outperforms the single-policy and continual-learning baselines in reward and mark-to-market terms. The best POW-dTS configuration reports reward of 4747 and MtM of 1.21, compared with the optimal multi-policy benchmark’s reward of 5254 and MtM of 1.22. The gap is not eliminated, but it is impressively narrowed without assuming perfect change-point detection.
The ablation study matters here. Random agents operating in blocks outperform the baseline, suggesting that simply using multiple policies sequentially already helps. But POW-dTS improves on that by adjusting weights according to observed performance. So the gain is not mystical Bayesian seasoning. It comes from two components: policy diversity and performance-sensitive allocation.
The business translation is direct. In changing markets, the unit of control may not be a single model. It may be a managed portfolio of policies.
What the evidence actually supports
The dissertation contains multiple experiment types. They should not all be read as equally decisive.
| Evidence item | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Single-agent DQL vs random and persistent MMs | Main evidence | DQL can learn profitable simulated market-making behaviour | Live trading profitability |
| Multi-agent DQL competition | Main evidence / stress test | Adaptive competitors make the environment harder and policies less universally reliable | That one trained RL policy is robust across all competitive structures |
| Direct transfer learning | Exploratory transfer test | Pre-trained policies can remain useful but may become volatile | Safe deployment of fixed policies |
| AIIF / DITF reward tests | Main inventory-control evidence | Dynamic inventory penalties can manage the profit-risk trade-off | That coefficient choices generalize across assets or regimes |
| Reward-function benchmarks | Comparison with prior approaches | Inventory-aware rewards outperform PnL-only design in this setup | Universal superiority of one reward design |
| MORL Pareto-front evaluation | Main multi-objective evidence | Separate objectives produce stronger trade-off frontiers than tested scalar alternatives | That all practical risk preferences are solved by one weight |
| EMA feature experiment | Robustness / sensitivity test | Trend variables do not help this high-frequency-style setup | That trend information is useless for trading generally |
| POW-dTS experiments | Main non-stationarity evidence | Policy weighting adapts well across simulated context changes | Production readiness without simulator validation and live-market safeguards |
| Random block ablation | Ablation | Sequential policy blocks are themselves useful | That dTS is the only useful part of the design |
This distinction matters because the paper’s strongest claim is architectural, not operational. It shows how to build increasingly realistic learning systems for market making in simulation. It does not show that a firm can drop the final algorithm into live markets and expect money to arrive, fully audited and wearing a tiny compliance badge.
The business value is research infrastructure before production alpha
For trading firms, exchanges, liquidity providers, and fintech platforms, the dissertation is most useful as an operating blueprint for experimentation.
The immediate business value is not “deploy this trader.” It is:
-
A simulator-first research workflow. The paper demonstrates how agent-based simulation can be used to train, compare, and stress market-making policies before live exposure.
-
A risk-governed policy design. The progression from PnL-only learning to inventory-aware reward engineering to MORL shows why trading AI must separate commercial objectives from risk constraints.
-
A policy-library mindset. POW-dTS suggests that adaptive trading infrastructure may benefit from libraries of specialised policies, rather than one universal model endlessly fine-tuned into confusion.
-
A measurable governance layer. Pareto fronts, hypervolume, sparsity, undominated solutions, and context-specific policy performance give operators diagnostic tools. Not perfect tools. But better than “the model liked it.”
-
A warning against naive continual learning. Some continual-learning approaches underperform or suffer catastrophic forgetting. In live trading, catastrophic forgetting is an especially charming phrase for “we forgot how not to lose money.”
The broader implication is that agentic trading systems should be managed like adaptive control systems, not like clever scripts. They need simulation governance, policy libraries, risk dials, regime monitoring, and fallback logic.
If that sounds less thrilling than an autonomous AI trader conquering markets, good. Thrill is not a risk-control framework.
The boundary is simulation, and the paper knows it
The dissertation is careful about its main boundary: all experiments are conducted in ABIDES. That is a strength for scientific control and a weakness for direct deployment.
Before applying this to a real asset, a firm would need a simulator that resembles the asset’s actual spreads, volatility, returns, liquidity, and microstructure behaviour. That simulator would also need continuous monitoring because markets evolve. A simulator that was realistic last quarter may become a museum exhibit with Python bindings.
The paper also notes the Sim2Real problem: transferring policies trained in simulation into the real world. This is not a formality. It is the central deployment obstacle. Real markets bring transaction fees, slippage, latency, operational errors, compliance constraints, capital limits, and strategic opponents who did not agree to behave like benchmark agents.
Training directly in the market is possible in principle, but expensive and risky. Unlike simulation, live learning happens in real time and pays real costs for bad exploration. Safe reinforcement learning becomes essential if online training is attempted.
The practical boundary, then, is clear:
- The paper directly shows strong results in controlled simulated markets.
- Cognaptus infers that the architecture is valuable for market-making research infrastructure and adaptive policy management.
- What remains uncertain is live-market profitability after transaction costs, slippage, simulator mismatch, regulatory constraints, and adversarial market dynamics.
That uncertainty does not weaken the paper. It prevents misuse. A hammer is more valuable when nobody insists it is also a violin.
Agentic trading is orchestration, not daydreaming
The phrase “agentic AI trader” invites a silly mental picture: a machine with market intuition, tactical creativity, and a tasteful Bloomberg terminal. The dissertation points toward something more useful and less theatrical.
The agentic part is not consciousness. It is orchestration. The system observes market state, selects actions, manages inventory, separates objectives, evaluates policy performance, and reallocates among policies when the environment changes. Its intelligence is distributed across design choices: state representation, reward structure, objective separation, simulation discipline, and policy weighting.
That is the real lesson for business. The next generation of AI trading systems may not be defined by a single model that “understands markets.” They may be defined by infrastructure that can train many specialised policies, expose their trade-offs, monitor their failure modes, and switch among them with measured scepticism.
Markets do not dream. They reprice.
If AI traders are going to survive inside them, they will need less fantasy and more machinery.
Cognaptus: Automate the Present, Incubate the Future.
-
Óscar Fernández Vicente, Market Making Strategies with Reinforcement Learning, arXiv:2507.18680, https://arxiv.org/abs/2507.18680. ↩︎