A boardroom simulation is only useful if you know what was being simulated.
That sounds obvious. It is also where many AI-agent demos quietly fall apart. Give one hundred language-model agents a set of personas, drop them into a toy market, forum, election, auction, or customer-support queue, and the result will usually look interesting. Someone panics. Someone coordinates. Someone overpays. Someone posts something faintly unhinged. Excellent. We have recreated the internet.
The harder question is whether the behaviour came from the agent’s “psychology”, the market rules, the prompt template, the memory buffer, the tool access, the model backend, the communication topology, or the researcher’s well-meaning duct tape. In most LLM-based agent-based modelling, those things arrive bundled together in one bespoke implementation. The result is a simulation that may be vivid but is difficult to reproduce, compare, or diagnose.
That is the useful target of Shachi, a methodology and open-source framework for LLM-based agent-based modelling from Sakana AI and the University of Tokyo.1 The paper is not mainly saying, “LLM agents are realistic now.” That would be the cheap version, and thankfully not the one worth reading. Its stronger claim is architectural: if we want LLM agents to support cumulative simulation science, we need to separate the agent from the world, then separate the agent’s cognition into testable modules.
That distinction matters for business. A simulation that merely produces plausible drama is a toy. A simulation where you can ask, “What changed when agents gained memory? What changed when they gained a news tool? What changed when their role configuration changed?” starts to become an experimental instrument.
Not an oracle. An instrument. Small difference. Large invoice.
Shachi treats the agent as a pod, not a prompt
The basic move in Shachi is to stop treating an LLM agent as one prompt-shaped blob.
The framework separates the agent’s internal policy from the environment’s dynamics. In ordinary terms: the agent decides; the world updates. Those two responsibilities are not allowed to leak into each other casually. The environment emits observations, collects actions, and advances the simulation. The agent receives an observation, deliberates through its architecture, may call tools within that step, and eventually outputs an action.
That separation is borrowed partly from reinforcement-learning-style interfaces such as Gym, but adapted for language agents. In Shachi, the environment has reset() and step()-style mechanics, while agents receive structured observations and return structured actions. The important business idea is not the API shape. It is accountability. If a simulated market crashes, you need to know whether the crash came from trader cognition or from the market-clearing rule. Otherwise you have not modelled a system. You have staged a puppet show with spreadsheets.
Inside the agent, Shachi decomposes cognition into four components:
| Component | What it represents | Operational interpretation |
|---|---|---|
| LLM | The reasoning engine that turns observations into responses | The base cognitive substrate: model choice, reasoning style, language behaviour |
| Configs | Intrinsic identity, role, constraints, policy tendencies | Persona, incentive, risk profile, institutional role, behavioural rules |
| Memory | Dynamic internal state and retained experience | History, learning, relationship continuity, prior exposure, institutional memory |
| Tools | External functions or services available to the agent | Search, calculators, market feeds, databases, platform actions, APIs |
This is why the title’s “pods over prompts” is not just wordplay. Shachi’s unit of design is not a clever instruction string. It is a pod of interacting components: model, identity, memory, and capability. Prompting may implement some of those components, but it is no longer the whole conceptual universe. Progress.
The paper also makes a clean distinction between actions and tool calls. An action changes the environment’s state. A tool call gives the agent information or computation inside the current decision step without advancing the global simulation clock. That matters because many messy agent systems confuse “thinking”, “looking something up”, “talking to someone”, and “acting on the world”. Shachi forces those categories apart.
The same principle applies to inter-agent communication. Agents do not simply call each other. Messages are mediated by the environment, which can enforce static networks, dynamic communication graphs, broadcasts, targeted messages, or platform-like interactions. This makes communication a property of the simulated world, not an accidental side effect of implementation.
That is the mechanism-first lesson: realism is not added by sprinkling more personas over a simulation. Realism, or at least analysable realism, comes from knowing which part of the agent-world system is responsible for which behaviour.
The benchmark is there to test portability, not to crown a champion
Shachi includes a ten-task benchmark arranged across three levels of social complexity.
Level I contains single-agent tasks such as PsychoBench, CoMPosT, CognitiveBiases, EmotionBench, and EmergentAnalogies. These are useful for probing individual traits, emotions, biases, or reasoning without the noise of social interaction.
Level II contains non-communicative multi-agent environments such as EconAgent, StockAgent, and AuctionArena. Here agents interact indirectly through the shared world: prices, auctions, macroeconomic indicators, and resource constraints.
Level III contains communicative multi-agent systems, including OASIS and Sotopia. These environments add richer direct communication through social media or role-play interaction, mediated by the environment.
This structure is not decorative taxonomy. It gives researchers a way to test whether an agent architecture that works in one kind of world travels to another. A minimal single-agent reasoning setup may be enough for analogical reasoning. It may be useless in a stock-trading world that expects memory and tools. Conversely, a heavyweight architecture may add little to a task that does not require it.
The paper’s reproduction experiment ports eight prior systems into Shachi and compares the reproduced outputs against the original systems. The headline result is simple: Shachi has lower mean absolute error than the baselines across all eight reproduced tasks.
| Task | Baseline MAE | Shachi MAE |
|---|---|---|
| PsychoBench | 1.96 | 0.80 |
| CoMPosT | 0.23 | 0.06 |
| CognitiveBiases | 0.24 | 0.04 |
| EmotionBench | 13.82 | 3.37 |
| EmergentAnalogies | 0.64 | 0.05 |
| StockAgent | 9.07 | 2.63 |
| AuctionArena | 10.49 | 2.22 |
| Sotopia | 3.17 | 0.95 |
This result should be read carefully. Lower MAE here does not mean Shachi agents are “better humans”, “better traders”, or “more realistic citizens”. It means the modular implementation reproduced prior task outputs more faithfully than simpler baselines. That is a comparison-with-prior-work result. Its purpose is to show that Shachi can absorb earlier bespoke systems without losing their quantitative behaviour.
That is already useful. In a fragmented field, reproduction is not glamorous, but it is the part that prevents every paper from becoming a snowflake with a benchmark attached.
The cross-task generalisation experiment then asks a more diagnostic question: what happens when an agent architecture from one task is deployed in another? The paper fixes the backend model and compares architectures with different module combinations. A StockAgent-style agent, which includes config, memory, and tools, maintains stable performance across the tested tasks. An AuctionArena-style agent, which lacks tools, transfers less well to StockAgent, where tool access matters. A minimal EmergentAnalogies-style agent performs adequately on simple reasoning tasks but drops sharply in AuctionArena.
The point is not “more components always win”. The Sotopia result complicates that easy story: transferred agents performed close to the diagonal even when the expected memory advantage was not obvious. The authors suggest that memory may not have been heavily used in that setup, or that the model’s context window handled short-term coherence well enough. Good. A framework that exposes when a component does not matter is more valuable than one that insists every architectural bell must be rung.
The experiments separate main evidence from exploratory leverage
The paper’s experimental arc has two halves.
The first half is foundational validation: can Shachi reproduce prior systems and compare architectures across tasks? That is the main methodological evidence.
The second half uses Shachi to ask questions that would be awkward in ad-hoc systems: what happens when agents carry memory between lives, inhabit multiple environments, or receive layers of information during a real-world shock? These are not all equal in evidentiary weight. They are better read as demonstrations of what modular simulation makes possible.
| Test | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Eight-task reproduction | Comparison with prior work | Shachi can preserve prior task behaviours with lower MAE than baselines | That the reproduced behaviours are externally valid |
| Cross-task generalisation | Sensitivity and comparability test | Component availability affects transfer across environments | That maximal agent complexity is always best |
| Memory transfer to CognitiveBiases | Exploratory extension | Prior simulated experience can shift later bias profiles | That LLM memory shifts match human psychology |
| StockAgent + OASIS “multiple worlds” | Exploratory extension | Behaviour can propagate across economic and social environments | That social-media-market coupling is accurately calibrated |
| Tariff shock simulation | Cumulative ablation plus external-validity probe | Config, memory, and tools change market reaction in a direction aligned with observed events | That the system can forecast markets reliably |
This distinction is important because the paper is tempting to oversell. “Agents live in multiple worlds” is a fun phrase. “External validity through tariff shocks” is even more clickable. But the serious contribution remains the machinery that lets those experiments be staged cleanly.
The memory-transfer experiment is a good example. The authors transfer agents from OASIS and EconAgent into the CognitiveBiases task without clearing their memories. In OASIS, agents previously lived in a social-media-like world with short-term reactions and community pressure. In EconAgent, agents experienced repeated economic interactions and asset ownership. When those memories carry over, the measured bias profiles change. OASIS memory appears associated with higher hyperbolic discounting and in-group bias. EconAgent memory appears associated with a stronger endowment effect and lower loss aversion and survivorship bias.
That is fascinating, but it is not a behavioural-science revolution. It is an architectural demonstration: memory is not just storage; it is an intervention. If you let agents bring past simulated experience into a new evaluation, you can study how history changes decision tendencies.
For business, that maps directly onto a familiar problem. A customer-service agent trained or conditioned on refund disputes may behave differently when moved into retention. A procurement agent exposed to shortage scenarios may become more conservative in normal purchasing. A risk-review agent carrying past fraud cases may start seeing ghosts in legitimate applications. Sometimes that is desirable. Sometimes it is institutional trauma with an API.
The value is being able to test it before the system touches real workflows.
Multi-world agents show why aggregate outcomes can surprise the designer
The “living in multiple worlds” experiment connects StockAgent and OASIS. The same agents first participate in a stock-market simulation, then move into a social-media environment, carrying internal state across both. The OASIS topic concerns Amazon opening physical stores; the StockAgent world has a chemical stock and a tech stock.
The intuitive expectation is simple: expose agents to Amazon-related social content, and perhaps they become more bullish on the tech-like stock. At the action level, that intuition partly appears. With OASIS present, volume increases by 10.0% for Stock A and 20.0% for Stock B. Buy orders rise for both stocks. Sell orders rise for Stock A but fall by 8.5% for Stock B.
So far, so tidy.
But the price movement is less tidy. The paper reports that with OASIS present, stock prices rise less than in the StockAgent-only setting. The authors note that this surprised them. The agent-level indicators showed more buying interest in the tech stock, but the system-level price movement did not translate into a simple surge.
That is exactly why simulation can be useful. It catches the point where local intuition fails at aggregate dynamics.
For a business reader, this is the difference between a focus group and a market. Individual agents can sound bullish, buy more often, sell less often, and still participate in a system where prices move differently because of matching rules, timing, liquidity, counterparty behaviour, or correlated actions. The environment matters. The distribution of actions matters. The mechanism connecting actions to outcomes matters.
This is also where bad agent demos become dangerous. A transcript of plausible individual reasoning can seduce executives into thinking the aggregate outcome is obvious. Shachi’s design encourages the opposite habit: inspect the components, run the environment, then compare agent-level and system-level signals. Reality, being inconsiderate, often hides in the gap.
The tariff shock is the most business-relevant experiment, but only directionally
The tariff-shock experiment is the paper’s most direct attempt at external validity. It uses StockAgent to simulate a five-day trading period from April 1 to April 5, 2025 around a U.S. tariff shock. The authors run a cumulative ablation across four settings:
| Setting | Added information or capability | What the test isolates |
|---|---|---|
| Base | Standard StockAgent setup | Baseline trading behaviour |
| Base + Config | A pre-announcement tariff headline added to configuration | Awareness of the policy shock |
| Base + Config + Memory | A memory summary of academic research on tariffs and markets | Background economic knowledge |
| Base + Config + Memory + Tool | A daily news-retrieval tool with escalating trade-tension updates | Timely information access |
The result is not just “agents sell when tariffs appear”. The sequence is more interesting.
In the base setting, the buy-to-sell ratio is 0.99 for Stock A and 0.73 for Stock B. When agents receive the tariff headline through configuration, the ratio drops to 0.51 and 0.45. Awareness alone makes agents more bearish. When agents also receive memory containing academic knowledge about tariff effects, the ratios rise to 0.62 and 0.59. The paper interprets this as less reactive behaviour: economic background softens raw news panic. Finally, when agents gain daily news access, the ratios move to 0.44 for Stock A and 0.55 for Stock B. The preference flips: Stock B becomes relatively more resilient.
The authors compare the most complete simulation with real market data for manually verified counterparts to the two synthetic stock profiles. For the chemical-like Stock A counterparts, DOW, EMN, and LYB declined by 20.5%, 16.4%, and 19.4%. For the tech-like Stock B counterparts, PLTR, HOOD, and PATH declined by 8.1%, 16.0%, and 6.8%. Both groups fell, but the tech-like group generally fell less. The full simulation’s directional pattern aligns with that broad outcome.
That is meaningful. It is also narrower than “the model predicted the market”.
The simulation used a small set of manually matched real-world stocks. It ran five trials over five simulated days. It depended on StockAgent’s market mechanics, the chosen profiles, the specific news snippets, the memory summary, and the backend LLM behaviour. The comparison is supportive, not conclusive. It shows that adding the right cognitive components can move agent behaviour toward a real-world pattern. It does not establish Shachi as a trading system, a macro forecaster, or a Bloomberg terminal with social anxiety.
For business, the right interpretation is decision rehearsal. A Shachi-like setup could help teams test how different information architectures affect simulated organisational or market behaviour. What happens when field teams receive only the headline? What happens when they also have policy memory? What happens when they can query fresh data? Which agents overreact? Which agents wait? Which agents amplify the panic through communication channels?
That is not prediction. It is structured counterfactual practice.
The business value is diagnosis, not synthetic theatre
Many companies are already flirting with agent simulation. Some want synthetic customers. Some want market stress tests. Some want negotiation agents. Some want to simulate platform behaviour, supply-chain shocks, regulatory responses, or sales conversations at scale.
The common failure mode is treating the output as the product. If the transcript looks realistic, the simulation is declared useful. This is how one ends up with executive workshops powered by improv theatre and token spend.
Shachi points to a more disciplined approach. The product is not the transcript. The product is the ability to vary components and observe behavioural deltas.
For example:
| Business question | Shachi-style experimental lever | Useful output |
|---|---|---|
| Will customers panic after a policy change? | Config-only awareness vs memory-informed agents vs tool-enabled agents | Sensitivity of reactions to information quality |
| Do sales agents become too aggressive after past wins? | Memory transfer across customer segments | Behavioural drift from accumulated experience |
| Does social chatter affect market or demand signals? | Multi-world agents across forum and transaction environments | Gap between sentiment, action counts, and aggregate outcomes |
| Which agent capability is actually worth paying for? | Remove or add memory, tools, configs, or stronger backend LLMs | Component-level ROI diagnosis |
| Can a simulation reproduce a known historical event? | Calibrated environment plus staged information ablations | External-validity check before forward use |
The useful managerial move is to ask not “Are the agents realistic?” but “Which mechanism made the outcome change?” That question is less glamorous, which is how we know it is probably useful.
In a customer-operation setting, Config might encode customer type, tenure, contract terms, or risk profile. Memory might encode prior complaints, promises, or unresolved cases. Tools might expose CRM records, refund rules, stock availability, or escalation channels. The LLM is the reasoning layer. The environment controls queues, response times, inventory, policies, and other agents. With that structure, a firm can test whether a bad outcome comes from the agent’s role definition, missing memory, poor tools, or the workflow environment.
In market or policy simulation, the same logic applies. Config captures agent type: retail investor, institutional trader, supplier, regulator, distributor. Memory captures prior shocks and learned beliefs. Tools expose news, price history, filings, logistics feeds, or policy documents. The environment defines trading rules, budget constraints, communication networks, settlement timing, and physical bottlenecks. The model’s output becomes one part of a larger causal machine.
That is the real business upgrade: from “ask a bot what might happen” to “run structured behavioural scenarios where the knobs are visible”.
The framework also exposes uncomfortable model risk
Shachi’s modularity does not eliminate model risk. It makes some of it easier to see.
The paper’s appendix includes an EconAgent backend comparison where different LLMs produce different macroeconomic details. The agents collectively show regularities consistent with the Phillips Curve and Okun’s Law, but the curves and indicators differ by backend. GPT-4.1 Nano yields higher unemployment, while GPT-4.1 produces stronger GDP growth in their setup.
This is not a footnote for researchers only. It is a warning label for anyone using LLM agents in business simulation. The “same” agent architecture can produce different emergent outcomes when the backend model changes. Upgrade the model, and your simulated economy may quietly change its temperament. Very convenient. Very normal. Definitely something to govern.
The environment is another source of risk. The authors explicitly note that the fidelity of an agent-based model depends not only on the agent architecture but also on the world it inhabits. In StockAgent, market-clearing rules matter. In OASIS, recommendation mechanics and network topology matter. In a supply-chain simulation, lead-time assumptions and substitution rules would matter. In a compliance simulation, escalation procedures and evidence visibility would matter.
This is where Shachi is useful but not magical. It helps isolate agent-side components. It does not certify that the environment is correct. A beautifully modular agent in a badly specified world is still a tourist in a cardboard city.
For serious deployment, three layers need separate validation:
- Agent cognition: Are configs, memory, tools, and model behaviour appropriate for the role?
- Environment mechanics: Do the simulated rules match the operational system being studied?
- External alignment: Do known historical or observed scenarios produce directionally credible outcomes?
The tariff-shock experiment is a first example of the third layer. It is not enough for high-stakes use, but it demonstrates the right validation instinct: compare simulated outcomes with a known real-world event, then inspect which cognitive components were necessary to get there.
What Cognaptus would take from this paper
The paper’s main lesson for enterprise AI is not that LLM agents can now simulate society. Please do not put that sentence in a procurement deck.
The more useful lesson is that agent-based simulation becomes operationally interesting only when agent cognition is decomposed into auditable parts. Shachi gives the field a vocabulary for doing that: LLM, Configs, Memory, Tools, and an environment interface that keeps action, communication, and state transition separate.
That vocabulary helps shift the conversation from spectacle to diagnosis.
A weak simulation says: “Here is what 500 agents did.”
A better simulation says: “Here is what changed when we added memory.”
A still better one says: “Here is what changed when we added memory, held the backend fixed, altered tool access, preserved the environment mechanics, and compared the result with a known historical episode.”
That is the path from ad-hoc agent theatre to decision infrastructure.
The paper is still early. The benchmark covers important prior systems, but not every domain. The external-validity claim rests on one tariff-shock study. The memory-transfer and multi-world experiments are exploratory. The simulations depend on environment design, model choice, prompting, tool data, and parsing reliability. None of that makes the work weak. It makes the boundaries legible.
And legible boundaries are exactly what most AI-agent systems currently lack.
The bottom line: simulation needs levers, not vibes
The tempting view of LLM-based agent modelling is that we can finally make artificial societies by writing better personas and scaling the crowd. Shachi’s answer is more sober and more useful: build agents as modular cognitive systems, expose the environment boundary, then run experiments where the causal levers are visible.
For business, that means the near-term value is not autonomous forecasting. It is cheaper, safer rehearsal. Test information shocks. Test memory effects. Test tool access. Test cross-domain contamination. Test whether a synthetic market, platform, or organisation reacts differently when agents know more, remember more, or can do more.
The output will still need scepticism. It will still need calibration. It will still need domain experts who know when the simulated world has become too clean, too weird, or too flattering.
But compared with a swarm of prompt puppets, Shachi is a serious step toward agent simulation that can be inspected rather than merely admired.
And in enterprise AI, “inspectable” is often where “useful” finally begins.
Cognaptus: Automate the Present, Incubate the Future.
-
So Kuroki, Yingtao Tian, Kou Misaki, Takashi Ikegami, Takuya Akiba, and Yujin Tang, “Reimagining Agent-based Modeling with Large Language Model Agents via Shachi,” arXiv:2509.21862, 2025, https://arxiv.org/abs/2509.21862. ↩︎