Plan, Act, Replan: When LLM Agents Run the Aisles

Retail planning usually fails in the hand-off.

A sales team sets a target. Inventory planners translate it into stock positions. Procurement checks supplier feasibility. Operations discovers warehouse constraints. Someone exports a spreadsheet, someone else reworks the assumptions, and by the time the plan looks executable, the market has already wandered off with the innocence of a cat near an open laptop.

That is the ordinary pathology behind the paper Rethinking Supply Chain Planning: A Generative Paradigm.¹ Its most useful claim is not that large language models can “do supply chain planning,” which would be a fine way to create expensive confusion at scale. The paper’s stronger argument is that planning itself should be rebuilt as a generative loop: interpret business intent, decompose it into executable tasks, retrieve the right data, invoke analytical tools, monitor deviations, and replan while the cycle is still alive.

In other words, the LLM is not being asked to replace operations research, forecasting systems, or experienced planners. Good. That would be adorable, in the way a toddler holding a forklift key is adorable. The architecture is closer to cognitive middleware: a semantic orchestration layer that sits between human intent and the structured machinery of enterprise planning.

The paper’s real object is the planning loop

The paper begins by redefining supply chain planning as “interactive, integrated, and automated.” That phrasing is a little grand, but the underlying diagnosis is concrete.

Traditional planning assumes that the hard part is computing an optimal answer once the problem is properly specified. In stable environments, that is often reasonable. In e-commerce, the specification itself keeps changing. Demand shifts during promotions. Stock positions vary by warehouse. Procurement constraints collide with category targets. A top-down revenue goal must become SKU-level actions; SKU-level failures must then travel back upward quickly enough to matter.

The authors frame this as a shift from static optimisation to continuous orchestration. Planning is no longer just a calculation over known inputs. It becomes a system for keeping intent, data, constraints, and execution mutually aligned.

A simplified version of the proposed mechanism looks like this:

Stage	What the agentic system does	Operational purpose
Query enhancement	Converts a natural-language request into structured planning parameters	Reduces ambiguity before execution starts
Intent classification	Routes the request into a planning, diagnostic, monitoring, or recommendation intent	Prevents prompt roulette
Task orchestration	Breaks the intent into executable subtasks	Turns business language into workflow
Data acquisition	Uses text-to-SQL and slot matching to retrieve relevant operational data	Connects planning to enterprise systems
Data analysis	Generates auditable data operations or invokes validated analytical tools	Combines flexibility with control
Iterative refinement	Updates the remaining task list after intermediate observations	Keeps the plan responsive
Plan correction	Monitors execution, diagnoses deviations, and recommends updates	Closes the loop before the month-end apology tour

That sequence matters because it moves GenAI away from the role of “clever answer generator” and into the role of “workflow interpreter.” The difference is not cosmetic. A clever answer generator gives you a plausible paragraph. A workflow interpreter knows which database to query, which solver to call, which assumption has become stale, and which task should be dropped because new evidence made it irrelevant.

This is not LLM-as-forecaster

A likely misreading of the paper is that JD.com has replaced planners with an LLM planner. That is not what is described.

The system keeps classical machinery in the loop. The LLM-based agents handle semantic interpretation, decomposition, routing, code generation, and orchestration. When mathematical precision is needed, the framework shifts toward function calling: pre-validated forecasting engines, inventory optimisation modules, and other deterministic analytical tools can be invoked rather than regenerated in prose. This is the right division of labour. Let the model translate and coordinate. Let the tested tools calculate.

The paper’s Task Execution Agent is especially revealing here. It does not simply ask the model to write a heroic block of Python and hope the warehouse survives. The authors structure generated analysis around four atomic operations: Filter, Transform, Groupby, and Sort. Each step has a narrower scope, clearer lineage, and a more inspectable output.

That design choice is less glamorous than saying “autonomous supply-chain intelligence,” but it is far more important. In enterprise planning, auditability is not a decorative feature. If a replenishment recommendation is wrong, someone needs to know whether the error came from stale data, bad slot extraction, incorrect aggregation, an unsuitable forecasting function, or a business rule that no longer applies. Atomic operations make that diagnosis possible.

The paper also uses retrieval-augmented generation and fine-tuning to ground the agents in domain-specific knowledge. Standard operating procedures, historical decisions, validated outputs, and planning protocols become part of the agent’s operating context. This is where the system starts to resemble an organisational interface rather than a chatbot with a supply-chain vocabulary pack.

The mechanism is vertical and horizontal at the same time

The authors describe planning as needing both vertical coherence and horizontal synergy. Put less ceremonially: the system must translate strategy downward and reconcile departments sideways.

Vertical coherence is the path from senior commercial intent to operational action. A quarterly or monthly target must become category plans, SKU-level priorities, warehouse allocations, replenishment schedules, and execution parameters. Manual organisations usually perform that translation through meetings, spreadsheets, and ritualised suffering.

Horizontal synergy is the coordination across functions with different incentives. Sales wants availability and revenue. Procurement wants feasible ordering and supplier stability. Inventory teams want turnover and controlled working capital. Operations wants capacity realism. Planning is where these tensions become either an executable compromise or a slow-motion mess.

The agent framework tries to handle both directions through a shared semantic environment. A business user can ask for a sales plan. The system retrieves the relevant SOP, classifies the intent, builds a task list, pulls historical sales and traffic data, analyses patterns, generates planning outputs, and then updates the task sequence as observations arrive.

The appendix example is useful not as main evidence, but as an implementation detail. It shows the workflow for a request such as generating a November sales plan for a computer department: retrieve SOPs, classify intent, create subtasks, generate SQL, load data into a dataframe, run pandas-based analysis, and then update the task list based on observations. The example is not proof that the system improves business results. It is a worked illustration of how the architecture turns a vague request into a runnable chain.

That distinction matters. Architecture examples explain feasibility; deployment metrics test usefulness. Mixing those two is how AI case studies become fog machines.

The evidence is a field deployment, not a lab benchmark

The paper’s main evidence comes from deployment inside JD.com’s supply-chain network. The setting is not toy-sized. The authors describe JD.com as managing around 10 million self-operated SKUs across a large warehouse and distribution network. For the empirical evaluation, they focus on more than 70,000 SKUs across three business units: Grain & Oils, Maternal & Infant Care, and Small Home Appliances.

The deployment began in December 2023 and was evaluated using a year-over-year comparison between May–June 2024, when the agent-driven system was used, and May–June 2023, the manual baseline. This period includes JD.com’s “618” Shopping Festival, a high-volatility retail event with distinct stock-up and sales-peak phases.

The results are commercially meaningful, but they should be read with adult supervision.

Paper element	Likely purpose	What it supports	What it does not prove
Agent architecture and module descriptions	Implementation detail	The framework is more than a single prompt; it includes routing, retrieval, code generation, tool invocation, and feedback	That every module contributes independently to the measured gains
Appendix workflow example	Implementation detail	A natural-language request can be converted into SOP-grounded tasks, SQL, analysis, and replanning	General performance across all planning scenarios
JD.com field deployment	Main evidence	The system can operate in a large, real-world e-commerce planning environment	Causal attribution equivalent to a randomized controlled trial
May–June 2024 vs May–June 2023 comparison	Main evidence with historical baseline	Agent-driven planning coincided with improved planning and availability metrics during a comparable retail period	That no external changes affected the year-over-year comparison
Atomic data-operation design	Implementation and governance mechanism	The system is built for traceability and debugging	That generated code is always correct or safe
Plan correction loop	Mechanism for resilience	The system can monitor deviations and recommend corrective action	Fully autonomous closed-loop control without human oversight

The headline metrics are threefold.

First, the paper reports a major reduction in planner workload. The deployment section breaks a manual weekly process into 20 minutes of data acquisition, 40 minutes of processing, and 60 minutes of analysis. In the agent-supported workflow, the user spends about 5 minutes describing the data requirement, while processing is automated. The paper’s summary language also refers to roughly 40% lower weekly data processing time. The safest interpretation is that routine data preparation and processing are heavily compressed, while planner attention shifts toward interactive analysis and strategic judgement.

Second, planning quality improves. The paper defines accuracy using deviation from end-of-week inventory values and reports a 22% increase in the proportion of plans with deviation below 5%. That is not the same as saying every forecast became 22% more accurate. It means more plans landed inside a tight deviation band. For managers, that is still important: planning quality is often less about one heroic forecast and more about raising the floor of routine decisions.

Third, stock fulfillment improves during peak volatility. At the SKU-distribution-centre level, the system maintained a stock fulfillment rate 2% above the historical manual baseline. The authors connect this to lower decision latency: rather than waiting for periodic batch review, the system can convert demand signals into execution updates faster. They estimate that the resulting agility produced around RMB 2 million in GMV uplift during the festival.

These are not vanity metrics. They map to labour productivity, stock availability, revenue capture, and planning consistency. But the evidence is still a field comparison inside one sophisticated platform. JD.com has scale, data infrastructure, operational discipline, and enough repeat planning volume to make orchestration valuable. A retailer with fragmented master data and undocumented SOPs should not expect the same lift by sprinkling agents over a swamp. Swamps remain rude.

The business value is shorter decision latency

The tempting interpretation is that the system improves planning because the model is smarter. The better interpretation is that the system shortens the distance between signal and correction.

Manual planning introduces latency at several points: translating intent, gathering data, cleaning it, deciding which analysis to run, interpreting results, routing findings across departments, and updating the plan. Each delay is survivable alone. Together, they create a planning cycle that is slower than the business it is meant to steer.

The agentic framework attacks those delays by making the workflow executable from the start. Natural language becomes structured slots. Slots become database queries and function calls. Intermediate outputs become observations. Observations change the next task. Deviations trigger diagnosis and correction.

This is why the plan-act-replan structure is more important than any single model component. In volatile retail, the first plan is often wrong by Tuesday. The operational question is whether the organisation can see that, explain why, and adjust before the error compounds into stockouts, overstock, or lost sales.

From a business perspective, the framework is best understood as a planning operating system:

It gives business users a natural-language interface to planning workflows.
It preserves specialist tools instead of pretending LLMs are optimisation engines.
It codifies veteran planner logic into SOP-grounded routines.
It creates traceable analytical steps for review and debugging.
It keeps task execution adaptive rather than fixed at the first decomposition.

That combination is where the value sits. Not in “AI makes a plan.” In “AI keeps the planning machinery moving while conditions change.”

What companies should copy

The most transferable lesson is not the exact JD.com architecture. It is the discipline of separating semantic work from analytical work.

Companies should copy the pattern:

Business requirement	Practical design response
Users ask vague planning questions	Add query enhancement and slot extraction before execution
Different teams use different planning logic	Route requests by intent and domain
Existing models already work	Wrap them as callable tools rather than replacing them
Generated code is risky	Restrict it to auditable atomic operations where possible
SOPs live in documents and expert memory	Build a retrieval layer over SOPs, rules, and historical decisions
Plans go stale mid-cycle	Add diagnosis and correction as first-class workflow stages
Leaders need accountability	Preserve intermediate task lists, code steps, data sources, and outputs

The first move is not buying a bigger model. The first move is codifying the planning grammar of the business. What are the recurring intents? Which metrics define success? Which data sources are authoritative? Which calculations must use validated functions? Which decisions require human approval? Which deviations trigger replanning?

Without those answers, an agent has nothing stable to orchestrate. It can still produce fluent output, naturally. Fluency is cheap. Operational reliability is where the invoice starts.

Where the result has boundaries

The paper is strongest as a mechanism-plus-deployment study. It is weaker as a universal performance guarantee.

The year-over-year comparison is useful, especially because it occurs during a demanding retail period, but it is not a randomized trial. Market conditions, assortment changes, operational improvements, supplier behaviour, and platform-level process changes could also differ between May–June 2023 and May–June 2024. The paper reports meaningful operational improvements, but it does not isolate every causal pathway.

The architecture also assumes substantial enterprise readiness. It needs high-quality data access, stable identifiers, SOP documentation, domain-specific retrieval, tool APIs, monitoring, and governance over generated code. These are not footnotes. They are the difference between an agentic planning system and a confident intern with database permissions.

There is also a human-control boundary. The system assists planners by automating acquisition, processing, analysis, diagnosis, and recommendation. The paper does not establish that firms should remove human judgement from planning. Quite the opposite: its strongest design logic is that humans provide strategic intent and oversight, while agents handle translation, execution, and feedback at machine speed.

That division should remain intact. Supply chains are full of exceptions: supplier politics, regulatory constraints, promotion strategy, brand commitments, and executive preferences that may not be fully visible in transaction data. A planning agent can surface options and shorten cycles. It should not become the silent owner of commercial trade-offs.

The strategic lesson is not automation; it is orchestration

The paper’s contribution is easy to understate because each component sounds familiar: RAG, fine-tuning, text-to-SQL, function calling, generated Python, task planning, feedback loops. None of these is exotic on its own.

The value is in the assembly.

JD.com’s framework treats supply chain planning as a living workflow rather than a static report. It uses the LLM where language and ambiguity are the bottleneck. It uses deterministic tools where mathematical reliability is needed. It uses SOP retrieval to keep the system aligned with organisational practice. It uses iterative replanning to prevent early assumptions from fossilising into bad execution.

That is the business lesson. The future of GenAI in operations is not a chatbot that “knows supply chain.” It is a control layer that turns intent into structured work, watches what happens, and updates the work before the business discovers the error through customer complaints.

Planning was never just a spreadsheet. It was always a negotiation between ambition and constraint. The useful agent does not end that negotiation. It makes the negotiation faster, more explicit, and less dependent on whoever last touched the workbook.

Tiny miracle, really. The aisles still need running. The difference is that the plan can finally keep up.

Cognaptus: Automate the Present, Incubate the Future.

Jiaheng Yin, Yongzhi Qi, Jianshen Zhang, Dongyang Geng, Zhengyu Chen, Hao Hu, Wei Qi, and Zuo-Jun Max Shen, “Rethinking Supply Chain Planning: A Generative Paradigm,” arXiv:2509.03811, 2025, https://arxiv.org/abs/2509.03811. ↩︎

The paper’s real object is the planning loop#

This is not LLM-as-forecaster#

The mechanism is vertical and horizontal at the same time#

The evidence is a field deployment, not a lab benchmark#

The business value is shorter decision latency#

What companies should copy#

Where the result has boundaries#

The strategic lesson is not automation; it is orchestration#