When Learning Goes Rogue: Fixing RL Biases in Economic Simulations

TL;DR for operators

Simulation is a dangerous place to confuse optimisation with truth. Chen and Zhang’s paper, From Individual Learning to Market Equilibrium, shows that a reinforcement learning agent can optimise very successfully and still fail to reproduce the economic equilibrium it was supposedly simulating.¹ That is the useful sting in the paper. The failure is not that the RL agent is too weak. The failure is that the environment quietly gives the agent the wrong economic role.

The paper studies a search-and-matching labour market where the theoretical competitive equilibrium has market tightness $\theta = 0.767$. A naive single-agent RL implementation converges to a much lower value, around $\theta \approx 0.1$. The agent is not merely noisy. It has learned a different game.

The authors identify two mechanisms behind the distortion. First, a structural bias: the RL firm is placed in a closed-loop environment where its own vacancy decision affects market tightness, so it behaves less like a tiny price-taker and more like a firm with market power. Second, a parametric bias: the RL reward uses the vacancy cost $c$ in a way that underweights the economic opportunity cost of posting vacancies. The proposed repair is Calibrated Mean-Field Reinforcement Learning: hold the aggregate market field fixed while the agent learns, update that field iteratively, and replace $c$ with

$$ c_{\text{eff}} = \frac{r+\lambda}{\lambda}c. $$

For operators building AI economists, pricing simulators, policy sandboxes, workforce planning tools, or market-agent platforms, the message is blunt: do not only ask whether the agent learned. Ask whether the agent learned under the same institutional assumptions as the model you claim to simulate. A beautifully converged RL run can still be an expensive way to violate the premise.

The agent did not fail; the modelling contract failed

A familiar business version of this problem looks harmless. A team wants to simulate firms, consumers, traders, drivers, couriers, or workers. They define a reward function, drop an RL agent into the environment, run training, and inspect the resulting behaviour. The reward stabilises. The curves look respectable. Someone says “the agent has learned the market”.

Possibly. Or the agent has learned to exploit the simulator.

That distinction matters because many economic models depend on role assumptions. A firm in a competitive equilibrium is not supposed to decide the market wage. A trader in a price-taking model is not supposed to move the whole price process. A household in a macro model is not supposed to rewrite the aggregate law of motion just because it found a convenient policy. Economic theory often says: the individual treats the aggregate as given, while the aggregate is determined by everybody together.

Standard single-agent RL does not naturally preserve that separation. The agent acts, the environment changes, and the agent learns the consequences. That closed loop is usually the point of RL. In this paper, it is also the bug.

Chen and Zhang place this issue inside a search-and-matching labour market. Firms post vacancies. Workers and vacancies meet through a matching function. Labour market tightness is

$$ \theta = \frac{V}{U}, $$

where $V$ is aggregate vacancies and $U$ is unemployment. In the theoretical model, an individual firm treats $\theta$ as given. In the naive RL translation, however, the single firm’s own vacancy choice feeds directly into $\theta$. The agent is therefore no longer solving the competitive firm’s problem. It is solving a simulator problem where its action can move the aggregate variable that should have been outside its control.

This is why the paper is better read mechanism-first than method-first. “Mean-field RL for economic simulation” sounds like a technique. The real contribution is diagnostic: the authors show two specific ways a technically plausible RL simulation can become economically wrong.

The benchmark says $\theta = 0.767$; naive RL learns something else

The paper’s benchmark model is deliberately standard. Production is concave, matching has constant returns to scale, and firms choose vacancies dynamically. The authors use the following functional forms: production $f(l)=Al^\alpha$, matching probability $q(\theta)=a\theta^{-\phi}$, and a wage equation that depends on labour, bargaining power, unemployment benefits, vacancy cost, and market tightness.

With the paper’s default parameters, the theoretical steady state is:

Variable	Meaning	Theoretical steady state
$l$	employment	0.967
$u$	unemployment	0.033
$q$	vacancy-filling probability	0.552
$w$	wage	0.831
$v$	vacancies	0.025
$\theta$	market tightness	0.767

That table is the control case. It gives the RL simulation something precise to reproduce.

The naive RL setup then gives the agent an observed state consisting of employment and unemployment, and a continuous action representing vacancy posting. The reward is current output minus wages and vacancy costs:

$$ r_t = f(l_t) - w(l_t,\theta_t)l_t - cv_t. $$

The agent is trained using DDPG, with actor and critic networks, experience replay, 50 independently simulated episodes per mean-field training iteration, and episodes of length 200. These details are not the thesis of the paper; they are implementation detail. Their role is to make the comparison concrete rather than purely philosophical.

The main evidence comes from Figure 1. The reward stabilises, suggesting the agent has learned a policy, but market tightness fluctuates around roughly $0.1$, far below the theoretical benchmark of $0.767$. That gap is the paper’s “something is wrong here” moment. A sevenfold difference in the central aggregate variable is not a rounding error wearing a lab coat.

The important interpretation is this: the RL agent’s convergence does not validate the economic simulation. It only validates that the agent found a good policy inside the environment it was given. If that environment encodes the wrong economic role, convergence merely makes the error more confidently repeatable.

Structural bias: the firm becomes a market manipulator

The first failure mechanism is structural. The theoretical firm is a tightness-taker. The naive RL firm is not.

In the economic model, the firm’s job-creation condition assumes that $\theta$ is exogenous to the individual firm. The firm asks: given market tightness, wages, separation risk, and vacancy cost, how many vacancies should I post?

In the naive RL environment, the agent observes that its own vacancy choice affects $\theta_t = v_t/u_t$. Since wages depend partly on $\theta$, the agent has an incentive to manipulate tightness. Lower tightness can suppress the wage term in the reward. The paper’s derivation makes this explicit: once $\partial \theta / \partial v$ is allowed to be non-zero inside the agent’s optimisation, an extra strategic term appears in the first-order condition. That term is not part of the competitive equilibrium condition.

In plain English: the simulator accidentally promoted the firm from a small participant to a market actor with aggregate influence. Naturally, the agent noticed. RL agents are rude like that.

This matters beyond labour markets. The same structural error can appear whenever a model assumes individual agents treat an aggregate variable as given, but the RL environment lets one representative agent move it directly.

Examples:

Domain	Aggregate that may need to be treated as exogenous by the individual	What goes wrong if the agent controls it
Labour simulation	market tightness, wage level	the firm learns wage manipulation rather than competitive hiring
Pricing simulation	market price, competitor response index	the agent learns artificial market power
Credit risk simulation	default environment, macro state	the lender optimises against a macro process it should not individually control
Supply chain simulation	congestion, clearing price, capacity utilisation	the firm exploits simulator feedback rather than market clearing
Policy sandbox	aggregate compliance, tax base, employment	the representative agent may internalise system-level response incorrectly

The error is subtle because the simulation may still look sensible locally. The agent posts vacancies. Employment changes. Wages respond. Rewards are computed. Nothing crashes. The mistake lives in the modelling contract: who is allowed to affect what?

Parametric bias: the reward prices vacancies too cheaply

The second mechanism is parametric. It is less theatrical than the manipulator effect, but just as damaging.

The economic model’s vacancy cost is not merely a per-period penalty sitting in a reward function. It is part of an intertemporal investment problem. When a firm spends resources posting a vacancy, it gives up the opportunity to invest those resources elsewhere. The job also has an expected lifetime governed by the separation rate $\lambda$. The interest rate $r$ and separation rate together shape the effective cost of job creation.

In the economic job-creation condition, the relevant cost term is proportional to

$$ \frac{(r+\lambda)c}{q(\theta)}. $$

In the naive RL simulation, the cost is treated more like a directly discounted flow cost. The paper argues that this effectively turns the corresponding cost logic into

$$ \frac{\lambda c}{q(\theta)}. $$

The missing part is the opportunity cost of capital. The RL agent sees that jobs end through $\lambda$, because separation is in the transition dynamics. But it does not properly account for $r$, the return forgone by tying capital into vacancy creation. So the agent can evaluate vacancy posting with the wrong economic semantics.

The correction is simple and revealing:

$$ c_{\text{eff}} := \left(1+\frac{r}{\lambda}\right)c = \frac{r+\lambda}{\lambda}c. $$

This is not just a parameter tweak. It is a translation layer between economic theory and RL reward design. The point is not that every RL simulator needs this exact formula. The point is that economic parameters often carry institutional meaning. If a reward function copies the symbol but not the meaning, the agent solves the wrong problem with admirable focus.

For business simulations, this is a common failure mode. A cost in a spreadsheet, a cost in an economic model, and a cost in an RL reward may share a label while representing different timing assumptions. That is how teams end up with simulations that are numerically disciplined and conceptually sloppy. The spreadsheet said “cost”. The model said “intertemporal opportunity cost”. The RL environment heard “small negative number per step”. Charming.

The repair: freeze the field, train the agent, update the field

The paper’s proposed method, Calibrated Mean-Field RL, repairs both biases together.

The structural correction uses a mean-field formulation. During agent learning, the aggregate market field — here, market tightness $\theta^{(k)}$ — is fixed. The agent then solves its decision problem as if it were atomistic. After the agent’s policy is learned, the system simulates population-level behaviour under that policy and updates the aggregate field. This repeats until the field is self-consistent.

The parametric correction replaces the raw vacancy cost $c$ with $c_{\text{eff}}$ in the reward.

The algorithmic structure is therefore:

Start with a guess for aggregate market tightness θ(0)
Compute c_eff = ((r + λ) / λ)c

Repeat:
  Train the RL agent while holding θ(k) fixed
  Obtain the agent policy π(k)
  Simulate the population under π(k)
  Update θ(k+1)
Until θ(k+1) is close to θ(k)

This is the paper’s main methodological contribution, but its value is easiest to understand as a discipline of separation. Inside each learning step, the agent is not allowed to treat the market as something it personally controls. Across iterations, the market is still endogenous. The aggregate changes, but only through the population-level update.

That distinction is the difference between “the firm reacts to the market” and “the firm is the market”. The latter is convenient for coding. The former is closer to the economic theory.

What each experiment is actually doing

The paper’s evidence is compact, so it is worth separating main evidence from supporting checks. Otherwise everything becomes “results”, which is a fine way to make readers more informed and less clear.

Paper component	Likely purpose	What it supports	What it does not prove
Theoretical steady-state solution	Benchmark	Establishes the economic target: $\theta = 0.767$ with known steady-state values	Does not show that RL can reach it
Naive RL result in Figure 1	Main evidence of failure	Shows RL convergence can produce $\theta \approx 0.1$, far from equilibrium	Does not isolate which bias causes how much distortion
Structural-bias derivation	Mechanism diagnosis	Shows that making $\theta$ depend on the agent’s own vacancy choice changes the first-order condition	Does not quantify the full empirical contribution of the term alone
Cost-calibration derivation	Mechanism diagnosis	Shows why $c$ in RL reward underprices vacancy creation relative to the economic opportunity-cost condition	Does not imply this exact adjustment applies to every economic model
Calibrated MF-RL algorithm	Proposed repair	Combines fixed-field learning with calibrated reward semantics	Does not guarantee universal convergence across arbitrary environments
Fully corrected result in Figure 2	Main evidence of repair	Shows corrected simulation converges around the theoretical $\theta = 0.767$	Evidence is still within one stylised labour-market setting
Appendix C ablations	Ablation	Shows either correction alone fails: MFG-only overshoots tightness; cost-only remains too low	Does not provide a broad sensitivity grid across model classes
Appendix D convergence theorem	Theoretical support	Gives contraction-based conditions for fixed-point convergence	Depends on Lipschitz assumptions and a composite constant below one

The ablation study is especially useful. With only the structural correction, the agent no longer manipulates tightness during training, but the vacancy cost remains too cheap. The result is over-optimistic hiring and market tightness above the benchmark. With only the parametric correction, the cost semantics are repaired, but the single-agent environment still lets the agent act like a market manipulator. The result remains well below equilibrium.

This is the cleanest empirical message in the paper: the two errors are not interchangeable. Fixing only one leaves a different wrong answer.

The business value is model governance, not an RL upgrade

The obvious but shallow reading is that Calibrated Mean-Field RL is a better algorithm for economic simulation. True enough, within the paper’s setting. But the more operational reading is sharper: the paper gives a governance checklist for agentic simulations.

When a company builds a market simulator, it usually worries about data, model fit, compute, explainability, and perhaps regulatory review. This paper points to a more basic layer: does the agent’s optimisation problem preserve the economic assumptions of the model?

Three checks follow directly.

First, define the agent’s market role before training. Is the agent supposed to be a price-taker, a strategic actor, a monopolist, a representative household, or a planner? These are not cosmetic labels. They determine which variables the agent may treat as controllable.

Second, audit aggregate feedback loops. Any variable of the form “market price”, “aggregate demand”, “tightness”, “congestion”, “default climate”, “capacity utilisation”, or “industry wage” deserves suspicion. If one agent can directly move it inside training, the simulation may be assigning market power that the theory never granted.

Third, translate economic parameters, do not merely copy them. A cost parameter may include opportunity cost, duration, risk adjustment, capital charge, or institutional constraints. The reward function needs the same semantics, not merely the same notation.

For an AI economist or policy sandbox, this affects scenario credibility. A simulator used for labour planning, tax policy, credit stress, logistics pricing, or market design can produce confident recommendations from a mis-specified game. The risk is not random noise. It is systematic distortion.

What the paper directly shows, what Cognaptus infers, and what remains uncertain

The paper directly shows that, in a stylised search-and-matching labour model with concave production, a naive single-agent RL simulation fails to reproduce the theoretical competitive equilibrium. It identifies two specific causes: structural bias from giving the agent control over an aggregate variable it should treat as exogenous, and parametric bias from misaligning the RL reward with economic opportunity-cost logic. It then shows that a combined mean-field and cost-calibration correction brings the learned simulation close to the theoretical benchmark. The ablations support the claim that both corrections are necessary in this setup.

Cognaptus infers that similar checks should be applied to business-facing agent simulations wherever individual agents interact with aggregate variables. The paper is especially relevant to market simulators, synthetic policy environments, RL-based pricing systems, workforce and hiring models, and macro-financial sandboxes. The transferable lesson is not the exact labour-market equation. It is the discipline of preserving role assumptions and parameter semantics.

What remains uncertain is the breadth of generalisation. The experiment uses one economic model, one principal RL algorithm family, and a specific equilibrium target. The convergence argument depends on Lipschitz continuity and a contraction condition, which the authors acknowledge is difficult to prove in deep-learning systems and is supported in their implementation by empirical smoothness. That is respectable, but it is not a universal warranty. Anyone selling one should be watched carefully.

A practical checklist for teams building RL economic simulators

Before trusting the output of an RL-based market simulation, ask:

Question	Why it matters	Failure sign
What role is the agent supposed to play?	Competitive agents, strategic firms, and planners solve different problems	The simulator silently gives a price-taker market power
Which variables are individual controls, and which are aggregates?	Equilibrium models often separate individual optimisation from aggregate consistency	The agent can directly move a market-level variable
Does the reward preserve economic meaning?	Parameters may encode opportunity cost, discounting, hazard rates, or institutional constraints	The same symbol appears in code but with different timing logic
Is convergence being mistaken for correctness?	An agent can converge to the wrong game	Reward stabilises while benchmark variables drift far from theory
Are ablations testing mechanisms rather than decoration?	Mechanism-level tests reveal whether each correction is necessary	A single “improved result” is reported without isolating causes
What benchmark anchors the simulation?	Without a known reference point, behavioural plausibility can become storytelling	The model is judged by vibes, which remain undefeated in some meetings

This checklist is not a substitute for formal validation. It is the minimum entry fee.

The limitation is narrowness, not irrelevance

The paper’s narrow setup is a limitation, but not a fatal one. In fact, the narrowness is partly what makes the result useful. Because the theoretical benchmark is explicit, the authors can show exactly where the RL simulation deviates. In messier real-world settings, there may be no clean equilibrium target, which makes this class of error harder to detect.

Still, practical users should not overgeneralise. The paper does not prove that Calibrated MF-RL is the right architecture for every economic simulator. It does not benchmark across many labour models, heterogeneous firms, alternative matching functions, multiple RL algorithms, or noisy empirical data. It does not solve all the hard problems of agent-based macroeconomic modelling. The appendix convergence result is conditional, not magical.

The right takeaway is therefore disciplined rather than grand: when using RL to simulate equilibrium behaviour, validate both the structural role of the agent and the economic meaning of the reward. If either is wrong, more training can make the answer cleaner and worse.

The agent learned exactly what it was allowed to learn

There is a tidy irony in this paper. The naive RL agent is not stupid. It is too capable for a badly specified economic role. It discovers that the simulator lets it influence market tightness, so it uses that channel. It evaluates vacancy costs under the reward semantics provided, so it optimises under those semantics. Then the human observer is surprised that the result does not match the economic benchmark.

The fault is not in learning. The fault is in translation.

That is why the paper matters for business AI. As firms move from prediction tools to agentic simulators, the hard problem is not only training agents that optimise. It is training agents inside environments where optimisation means what the business thinks it means. Otherwise the simulator becomes a theatre of competence: stable curves, confident policies, and a quiet violation of the model’s assumptions.

Calibrated Mean-Field RL is one repair for one class of economic simulation. The larger lesson is broader and less comfortable: before asking whether an AI agent has found the optimum, ask whether you built the right game.

Cognaptus: Automate the Present, Incubate the Future.

Ruxin Chen and Zeqiang Zhang, “From Individual Learning to Market Equilibrium: Correcting Structural and Parametric Biases in RL Simulations of Economic Models,” arXiv:2507.18229, 2025. ↩︎

TL;DR for operators#

The agent did not fail; the modelling contract failed#

The benchmark says $\theta = 0.767$; naive RL learns something else#

Structural bias: the firm becomes a market manipulator#

Parametric bias: the reward prices vacancies too cheaply#

The repair: freeze the field, train the agent, update the field#

What each experiment is actually doing#

The business value is model governance, not an RL upgrade#

What the paper directly shows, what Cognaptus infers, and what remains uncertain#

A practical checklist for teams building RL economic simulators#

The limitation is narrowness, not irrelevance#

The agent learned exactly what it was allowed to learn#