The Sandbox Economy: When LLMs Stop Talking and Start Shopping

Opening — Why this matters now

Everyone wants AI agents that can “act.” Few can explain what that actually means in a market context. Generating text is trivial. Simulating decisions under constraints—price, inventory, demand elasticity—is where things start to look suspiciously like… economics.

The uncomfortable truth is this: most AI systems today can talk like consumers, but they don’t behave like them. They lack price sensitivity, memory of past purchases, and—perhaps most critically—any coherent response to incentives.

This paper introduces MALLES, a multi-agent LLM economic sandbox designed to close that gap. Not by adding more prompts, but by forcing models to live inside a simulated economy where decisions have structure, feedback, and consequences.

In other words: we’re watching LLMs graduate from chatbots to economic actors.

Background — Context and prior art

Economic simulation is not new. It has simply been… underwhelming.

Traditional approaches fall into three camps:

Approach	Strength	Weakness
Rule-based / survey models	Interpretable	Rigid, poor generalization
Deep learning demand models	Predictive power	Requires heavy feature engineering
Agent-based simulations (ABM)	Rich interactions	Computationally expensive, brittle

Recent LLM-based systems (e.g., EconAgent, LLM Economist) introduced semantic agents—entities that can reason, negotiate, and explain.

But they suffer from three structural issues:

Data sparsity — Real transaction data is fragmented across categories
OOD failure — Models collapse when encountering new products
Numerical blindness — LLMs treat price like decoration, not signal

In short: they can simulate dialogue, but not demand curves.

Analysis — What MALLES actually does

MALLES takes a different stance: stop treating LLMs as storytellers, and start treating them as decision functions.

Formally, the system tries to approximate:

$$ \hat{a}_i = \hat{D}(X^{obs}_i, Z_i, \hat{\rho}_i; \theta) $$

Where behavior is reconstructed from observable inputs, latent profiles, and learned preferences.

1. Cross-category preference alignment

Instead of training on narrow product categories, MALLES aggregates transaction data across thousands of categories.

This creates a transfer effect:

Training Strategy	Data Size	Generalization	Practical Use
Single-category	Small	High precision, low transfer	Mature products
Full-category	Large	Strong transfer	New products / sparse data

The theoretical intuition is straightforward: more data reduces variance, but cross-category similarity adds a regularization effect.

2. Multi-agent discussion (yes, actual “thinking”)

Instead of one LLM pretending to be everyone, MALLES splits roles:

Dealer (decision maker)
Service agent (market interface)
Manufacturer (constraints)

They engage in structured dialogue before producing a decision.

This does two things:

Compresses high-dimensional inputs into actionable signals
Avoids local optima (single-agent tunnel vision)

It’s less “AI thinking,” more organizational simulation.

3. Mean-field stabilization

Here’s where it gets quietly sophisticated.

Individual decisions are noisy. Markets are not.

MALLES introduces a mean-field variable $\mu_t$ to represent aggregate market behavior and feeds it back into agents.

Without Mean-Field	With Mean-Field
High variance decisions	Stabilized outputs
Fragmented context	Population-aware decisions
Poor consistency	Convergent behavior

This bridges micro decisions and macro patterns—something most agent systems conveniently ignore.

4. Numerical alignment (the real differentiator)

The system explicitly trains LLMs to care about:

Price
Discounts
Quantity

Using attention constraints and loss functions, e.g.:

$$ L_{attn} = E[KL(A_i || A_i^*)] $$

Which enforces attention toward economically relevant features.

This is subtle but critical: it converts LLMs from language models into economic approximators.

Findings — What actually improves

The results are not just incremental—they reveal a structural shift in capability.

Core performance comparison

Model	Hit Rate	Quantity Error	Stability	Time Cost
LLM Economist	0.38	0.95	0.42	Low
EconAgent	0.57	0.97	0.33	Low
FinCon	0.80	1.32	Low	Medium
MALLES (Base)	0.70	0.94	0.38	Low
MALLES (Enhanced)	0.77	0.65	0.35	High

Key observations

Accuracy vs Stability Trade-off Resolved MALLES maintains strong hit rates while significantly reducing variance.
Numerical reasoning actually emerges Quantity prediction improves meaningfully—something most LLM systems fail at.
Cross-category training works Models trained on broader datasets generalize better in sparse settings.

Ablation insights

Component Removed	Impact
Post-training	Collapse in price sensitivity
Multi-agent discussion	Reduced strategy diversity
Mean-field	Higher variance, unstable outputs

In other words: each component is not optional—it’s structural.

Implications — What this means for business

This is where things get interesting (and slightly uncomfortable).

1. Simulation replaces intuition

Instead of guessing:

“Will this discount work?”
“How will customers react?”

You simulate it.

At scale.

Repeatedly.

2. Decision-making becomes testable

MALLES introduces a closed-loop system:

Step	Function
Strategy input	Pricing / promotion / inventory
Simulation	Multi-agent evaluation
Feedback	Performance metrics
Adjustment	Iterative refinement

This is effectively A/B testing for entire strategies, not just UI buttons.

3. AI agents become economic infrastructure

Not tools. Not assistants.

Infrastructure.

The same way databases replaced spreadsheets, these systems may replace:

Market research surveys
Demand forecasting heuristics
Manual pricing strategies

4. Risks (because there are always risks)

Behavioral manipulation: hyper-accurate preference modeling can be weaponized
Data dependency: garbage transaction data → garbage economic agents
Overfitting to past markets: structural shifts still break models

In short: powerful, but not neutral.

Conclusion — From language to logic

MALLES is not just another “multi-agent framework.”

It’s a signal that LLMs are evolving from:

systems that describe the world → systems that simulate it

The real breakthrough isn’t that agents can talk to each other.

It’s that their conversations now produce numerically grounded, economically consistent decisions.

And once that happens, the line between simulation and strategy becomes… negotiable.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — Context and prior art#

Analysis — What MALLES actually does#

1. Cross-category preference alignment#

2. Multi-agent discussion (yes, actual “thinking”)#

3. Mean-field stabilization#

4. Numerical alignment (the real differentiator)#

Findings — What actually improves#

Core performance comparison#

Key observations#

Ablation insights#

Implications — What this means for business#

1. Simulation replaces intuition#

2. Decision-making becomes testable#

3. AI agents become economic infrastructure#

4. Risks (because there are always risks)#

Conclusion — From language to logic#