Bias in the Warehouse: What AIM-Bench Reveals About Agentic LLMs

TL;DR for operators

AIM-Bench is not another “which model is smartest?” leaderboard. It is a warehouse stress test for agentic LLMs asked to make replenishment decisions under uncertainty.¹

The useful lesson is uncomfortable: inventory agents can look mathematically fluent while still behaving like biased managers. Most evaluated models show mean anchoring in the newsvendor task. All evaluated models show bullwhip amplification in the Beer Game. Some models over-order to avoid stockouts; others keep leaner inventory but accept higher shortage risk. In other words, the operational personality of the model matters.

For deployment, the paper suggests three practical moves. First, evaluate inventory agents by scenario, not by generic reasoning score. Second, track process metrics, not only end-state KPIs. Third, treat mitigations such as cognitive-reflection prompting and information sharing as partial controls, not magic dust. They reduce some failures, and occasionally create new ones. Very on brand for enterprise AI, sadly.

The boundary is equally important. AIM-Bench uses simulated text-based environments, not live ERP, warehouse-management, supplier, or transport systems. The results are best read as a screening framework for decision risk, not as proof that a given model will save your inventory budget next quarter.

The familiar problem: a stockout is not a philosophy seminar

Inventory management is where elegant reasoning goes to be humbled by Tuesday.

A retailer orders too little and loses sales. It orders too much and traps cash in slow-moving stock. A logistics firm tries to balance service levels, lead-time noise, supplier delays, and the small matter of customers not behaving like Gaussian distributions because apparently they have free will.

This is exactly the kind of setting where agentic LLMs sound tempting. Give the model sales history, supplier context, inventory status, maybe a few operational rules, and let it recommend replenishment actions. The pitch is neat: language models can reason, plan, explain, and interact with tools. Surely they can help decide how many units to order.

AIM-Bench asks the right irritating question: when an LLM becomes an inventory manager, does it optimise like a decision model, or does it inherit the familiar biases of human operators?

The answer is not “LLMs are useless.” That would be too easy, and therefore suspicious. The answer is sharper: they are uneven operational decision-makers. Their errors are structured. Their biases can be measured. And the model that behaves best in one supply-chain setting may be the one causing chaos in another.

AIM-Bench tests uncertainty by category, not by vibes

The paper introduces AIM-Bench as a benchmark for inventory LLM agents across five environments:

Environment	Main operational setting	Main uncertainty exposed
Newsvendor Problem	One-shot ordering for a single selling period	Stochastic demand
Multi-period Replenishment	Repeated replenishment over time	Stochastic demand and variable lead times
Beer Game	Multi-echelon supply chain with sequential ordering	Partner behaviour and order amplification
Two-level Warehouse Network	Central warehouse plus downstream mini-warehouses	Routing, replenishment, and multi-source uncertainty
Supply Chain Network	Downstream actor with normal and expedited upstream sources	Source selection, lead-time trade-offs, and cost-risk balance

This matters because “inventory management” is not one task. A one-period order decision is not the same as managing repeated replenishment with variable lead times. A multi-echelon system is not just a bigger spreadsheet. It is a coordination problem where each player’s action becomes someone else’s signal.

The benchmark also separates outcome metrics from process metrics. Outcome metrics include average inventory cost, stockout rate, and turnover rate. These resemble the metrics operators already understand. Process metrics ask a more diagnostic question: how far were the model’s replenishment actions from an ex-post optimal quantity?

That second layer is where the paper gets more useful. A model can reach a similar stockout rate for the wrong reasons. It can pile up inventory, suppress shortage, and look “safe” until finance asks why working capital has been quietly stuffed into the warehouse. The process metric helps identify that pattern before it is disguised as operational competence.

The evidence is a set of failure modes, not a trophy ceremony

The paper evaluates five models: Qwen-2.5-72B, DeepSeek-V3, GPT-4o, GPT-4.1, and Gemini-2.5-flash-lite. But reading AIM-Bench as a leaderboard misses the point. The stronger reading is by failure category.

Paper component	Likely purpose	What it supports	What it does not prove
Newsvendor framing and anchoring tests	Main evidence for decision bias	LLM agents can anchor on mean demand and fail to adjust fully toward the optimal order quantity	That the same bias magnitude will appear in every product category or demand distribution
Demand chasing analysis	Main behavioural diagnostic	Models show less demand-chasing bias than typical human results reported in the behavioural inventory literature	That LLMs are generally bias-free in sequential demand settings
Cognitive-reflection prompting	Mitigation ablation	Prompting models to imitate slower, System-2-style reasoning can reduce anchoring in some cases	That reflection prompts create reliable optimisation behaviour
Beer Game and information sharing	Main evidence plus mitigation test	All evaluated models show bullwhip amplification; information sharing reduces it for some models	That full transparency always improves agentic supply-chain control
Multi-period distance-to-optimal metric	Evaluation design contribution	Process metrics can distinguish models that look similar under stockout or cost metrics	That ex-post optimal distance alone captures all operational objectives
Table of real-world-style metrics across BG, MPR, TWN, and SCN	Cross-environment comparison	Models have distinct operational profiles across supply-chain structures	That one model is universally best for deployment

That table is the core editorial frame. AIM-Bench is interesting because each test exposes a different operational weakness. The headline is not “model X wins.” The headline is “the same model can be prudent in one warehouse and reckless in another.” Truly, the future of automation is learning that procurement already had problems.

Failure mode one: mean anchoring makes the model conservative in the wrong way

In the newsvendor problem, the agent decides how much stock to order for one selling period. The economically optimal order depends on the cost of under-ordering versus over-ordering. If missing a sale is expensive relative to leftover stock, the optimal order should move above average demand. If excess stock is costly, it should move lower.

Humans often show a pull-to-centre effect: they start from mean demand and insufficiently adjust toward the optimal quantity. AIM-Bench finds that most evaluated LLM agents show a similar pattern.

The reported high-margin newsvendor setting is telling. GPT-4o shows anchoring factors of 1 and 0.925. GPT-4.1 still shows anchoring, with a reported value of 0.405 in one frame. DeepSeek-V3 shows substantial anchoring in the negative frame, with a reported value of 1.375. Gemini-2.5-flash-lite is the outlier in this setting, showing complete immunity to the anchoring metric reported by the authors.

The business interpretation is simple but not comforting. A model can understand the words “maximise expected profit” and still choose quantities pulled toward a psychologically convenient centre. It does not need a human body to inherit human-looking shortcuts. Text is apparently enough. Marvellous.

For operators, mean anchoring means the agent may under-react when economics says it should move aggressively. That can be especially damaging in high-margin items where stockout costs dominate, or in constrained supply situations where conservative ordering is not actually conservative at all. It is just quieter failure.

Failure mode two: demand chasing is weaker, which is good but not a personality trait

The paper also tests demand chasing: the tendency to adjust current orders toward the previous demand realisation. In independent demand settings, yesterday’s demand should not automatically drive today’s order. Humans often chase the latest observation anyway, because the last number always feels more real than the distribution. This is why dashboards are dangerous in the hands of the over-caffeinated.

AIM-Bench reports that LLMs show significantly less demand-chasing bias than humans, often producing unbiased dominant responses.

That is a real positive. It suggests that LLM agents may be less prone to one common behavioural inventory error. But the result should not be over-promoted into a general claim that LLMs are disciplined statistical decision-makers. The same paper shows anchoring elsewhere and bullwhip amplification in multi-echelon settings.

The better interpretation is narrower: demand chasing may not be the primary failure mode for these models under this setup. That is useful because it tells evaluators where to look next. A benchmark that only says “good” or “bad” is not a benchmark; it is a horoscope with numbers.

Failure mode three: framing does not reliably turn inventory agents into prospect-theory puppets

The paper tests whether positive and negative framing can shift model risk preferences in the newsvendor task. Prior work in LLM decision-making has found context-dependent risk patterns, including cases where models resemble human prospect-theory behaviour. AIM-Bench tries to see whether framing inventory decisions as profits versus losses changes the agent’s ordering behaviour.

The authors do not find evidence of risk reversal in this context.

This is more important than it looks. Many AI governance conversations assume behavioural tendencies are portable: if a model is risk-averse in one task, perhaps it will be risk-averse elsewhere; if it displays framing sensitivity in one experiment, perhaps we can prompt around it in production. AIM-Bench pushes back. Behavioural theories for LLMs need to be tested in the actual decision environment.

For business use, this means prompt-based behavioural assumptions should be validated before becoming policy. A procurement agent that resists framing in one simulated task may still respond to other wording, role instructions, or tool feedback in production. The paper does not prove framing is irrelevant. It proves that easy behavioural transplants are unreliable.

Failure mode four: bullwhip amplification is where the warehouse gets expensive

The Beer Game is the classic supply-chain lesson in how small demand changes become large upstream ordering swings. Retailer orders affect wholesaler orders; wholesaler orders affect distributor orders; distributor orders affect production. Noise travels upstream, gains weight, and eventually arrives as a very serious meeting.

AIM-Bench finds that all evaluated LLM agents exhibit the bullwhip effect due to demand overestimation. Gemini-2.5-flash-lite, which performs best on newsvendor anchoring, performs badly here: the paper reports bullwhip-effect values of 19.22 and 28.61. GPT-4.1 also shows substantial demand overestimation.

This is the best example of why model selection cannot rely on a single operational test. A model that looks clean in a single-period decision can amplify volatility in a multi-agent chain. The failure is not only about calculating an order quantity. It is about interpreting another actor’s behaviour under partial observability.

A human supply-chain manager might over-order because the downstream signal looks like demand growth. An LLM agent can do the same, except it will explain itself in neat paragraphs while creating the same inventory mess. The prose is better. The pallet count is not.

Process metrics catch failures that KPIs can hide

The paper’s multi-period replenishment setting introduces a distance metric between the model’s order and an ex-post optimal order quantity. This is not just a mathematical nicety. It solves a practical evaluation problem.

Two agents can produce similar stockout rates while following different policies. One may be genuinely close to optimal. Another may be over-ordering, hiding shortages by carrying too much inventory. If you only look at stockout rate, both look acceptable. If you look at process distance, the difference becomes visible.

The paper gives a concrete example: GPT-4.1 and Qwen-2.5-72B have similar stockout rates in the multi-period setting, 0.250 and 0.256 respectively, but their distance metrics differ: 467 for GPT-4.1 versus 608 for Qwen-2.5-72B. That gap suggests GPT-4.1 is closer to the optimal replenishment path even when the outcome metric looks similar.

Table 1 reinforces the same idea. In multi-period replenishment, GPT-4.1 reports an average cost of 332, turnover rate of 2.38, and stockout rate of 0.25. GPT-4o reports a higher average cost of 1090, turnover of 4.60, and stockout rate of 0.46. DeepSeek-V3 shows a lower turnover rate of 1.69 but a higher stockout rate of 0.54.

No single number tells the story. Cost, turnover, stockout, and process distance describe different parts of the operating profile. That is exactly how real inventory systems behave. The agent’s “answer” is only one artefact; the policy trajectory is the thing you are actually deploying.

Mitigation works, but it behaves like a control knob, not a cure

AIM-Bench evaluates two mitigation strategies.

The first is cognitive reflection prompting. The authors design a prompt intended to imitate slower System-2-style reasoning. This reduces anchoring in some cases. For Qwen-2.5, the anchoring factor drops from 0.7 to 0.255 in the positive frame and from 0.74 to 0.245 in the negative frame.

That is promising. It also has a narrow meaning. The mitigation is prompt-dependent. It shows that reasoning style can change behaviour in a simulated task. It does not show that a reflection prompt will remain stable under live data feeds, exceptions, supplier messages, or users trying to “just override this once.”

The second mitigation is information sharing in the Beer Game. Instead of leaving agents with only partial observations, the authors expand the state space so agents can access partner information. This significantly reduces bullwhip amplification in some cases. For Qwen-2.5, the bullwhip effect falls from 13.78 to 4.45 in one comparison and from 23.07 to 10.73 in another.

Again, useful. Again, not magic. GPT-4o shows action chasing under information sharing, with bullwhip close to zero. That may sound good until you realise it can indicate conformity rather than intelligent coordination. Too little amplification is not automatically optimal if the agent is merely copying partners and losing strategic independence.

The practical message is that mitigation should be evaluated as a control system. Reflection changes how the agent reasons. Information sharing changes what the agent sees. Both can reduce one bias while exposing another. This is not a reason to abandon them. It is a reason to instrument them properly.

What this means for supply-chain AI adoption

AIM-Bench points toward a more mature deployment pattern for LLM inventory agents.

The wrong pattern is: choose the strongest general-purpose model, connect it to the inventory database, ask for replenishment recommendations, and call the result “agentic optimisation.” That is not transformation. That is a compliance incident with a nice interface.

The better pattern is scenario-screened deployment:

Deployment question	AIM-Bench lesson	Practical control
Does the model under-react to economic incentives?	Mean anchoring appears in most evaluated models	Test newsvendor-style decisions under different margin and salvage assumptions
Does the model chase recent noise?	Demand chasing is weaker here, but still worth measuring	Track correlation between current orders and prior demand shocks
Does the model amplify partner signals?	All evaluated models show bullwhip amplification	Simulate multi-echelon settings before connecting upstream order recommendations
Does the model hide risk through inventory buffers?	Outcome metrics can conceal over-ordering	Use process-distance metrics alongside cost, turnover, and stockout
Does mitigation reduce one failure while creating another?	Reflection and information sharing help, but not uniformly	Treat prompts and observability settings as tested controls, not fixed slogans

This is especially relevant for supply-chain software vendors. If an LLM agent is sold as an inventory assistant, the evaluation should not stop at response quality. It should include operational behaviour under stochastic demand, variable lead times, supplier uncertainty, and multi-agent coordination. The agent should be tested like a replenishment policy, not like a chatbot.

For retailers and logistics firms, the implication is procurement discipline. Ask vendors for scenario-level evidence. Ask how they measure bullwhip amplification. Ask whether they test distance from optimal replenishment paths. Ask what happens when information sharing increases conformity. Then enjoy the silence, which may be the most informative part of the demo.

What the paper shows, what Cognaptus infers, and what remains uncertain

The paper directly shows that AIM-Bench can expose human-like decision biases in LLM inventory agents across simulated supply-chain environments. It shows mean anchoring in most evaluated models, broad bullwhip amplification, weaker demand chasing than typical human behavioural results, and partial mitigation through cognitive-reflection prompting and information sharing.

Cognaptus infers that LLM inventory agents should be governed as decision-risk systems. The model is not just generating text; it is producing operational actions or recommendations. That means evaluation should include behavioural diagnostics, scenario stress tests, and process metrics before the agent is allowed near replenishment workflows with financial consequences.

What remains uncertain is production transfer. AIM-Bench uses text-based simulation environments. Real deployments include messy master data, supplier politics, promotional calendars, partial integrations, bad forecasts, late trucks, human overrides, and the timeless joy of someone uploading the wrong spreadsheet. The benchmark does not prove live ERP or WMS performance. It tells us what to test before pretending we know.

There is also a mitigation boundary. The paper explores prompt-dependent methods, not reinforcement learning or policy training. Future systems may reduce these biases through specialised training, constrained optimisation layers, or hybrid architectures that reserve final replenishment decisions for classical solvers. AIM-Bench does not close that design space. It gives it a measuring instrument.

The conclusion: do not hire an inventory agent without checking its habits

The lazy reading of AIM-Bench is that LLMs are biased. True, but incomplete.

The more useful reading is that LLM inventory agents have operational habits. Some anchor. Some over-order. Some amplify upstream volatility. Some look good under one metric and poor under another. Some improve when asked to reflect; some become too imitative when given partner information.

That is exactly the kind of behaviour businesses need to know before deployment. The question is not whether an LLM can explain inventory management. Many can. The question is whether it can make repeated decisions under uncertainty without turning small demand noise into expensive warehouse theatre.

AIM-Bench does not give operators a final answer. It gives them a better interview question: before this agent manages inventory, show me how it behaves when the supply chain stops being polite.

Sources

Cognaptus: Automate the Present, Incubate the Future.

Xuhua Zhao, Yuxuan Xie, Caihua Chen, and Yuxiang Sun, “AIM-Bench: Evaluating Decision-making Biases of Agentic LLM as Inventory Manager,” arXiv:2508.11416, 2025. https://arxiv.org/abs/2508.11416 ↩︎

TL;DR for operators#

The familiar problem: a stockout is not a philosophy seminar#

AIM-Bench tests uncertainty by category, not by vibes#

The evidence is a set of failure modes, not a trophy ceremony#

Failure mode one: mean anchoring makes the model conservative in the wrong way#

Failure mode two: demand chasing is weaker, which is good but not a personality trait#

Failure mode three: framing does not reliably turn inventory agents into prospect-theory puppets#

Failure mode four: bullwhip amplification is where the warehouse gets expensive#

Process metrics catch failures that KPIs can hide#

Mitigation works, but it behaves like a control knob, not a cure#

What this means for supply-chain AI adoption#

What the paper shows, what Cognaptus infers, and what remains uncertain#

The conclusion: do not hire an inventory agent without checking its habits#

Sources#