A store manager does not usually make assortment and pricing decisions inside a clean optimization textbook.
More often, the decision lives in a less glamorous place: a sales spreadsheet, a distributor agreement, an approval memo, last month’s exception report, a half-remembered rule about which customer can handle which category, and one person in the room saying, “This SKU always works in that region.” Retail intelligence, in other words, often begins as a pile of semi-structured clues wearing a business-casual disguise.
That is the real problem behind DeepRule, a new framework for automated business rule generation in retail assortment and pricing optimization.1 The paper is easy to misread as another “LLMs will optimize prices” story. It is not. Thankfully. We have enough magic pricing decks already.
The stronger reading is narrower and more useful: DeepRule treats LLMs as one component inside a closed-loop decision system. The LLM parses messy business text, proposes interpretable rule structures, reflects on failed decisions, and helps search symbolic rules. But demand prediction, constraint handling, cost calculation, parameter optimization, deployment checks, memory, and human validation still matter. In fact, the whole paper is interesting precisely because it does not hand the pricing problem to an LLM and hope the transformer has attended to gross margin.
DeepRule’s contribution is a mechanism: turn messy business knowledge into structured features, use predictive modeling to estimate sales response, search for auditable assortment-pricing rules, optimize those rules under business constraints, evaluate the result, remember the errors, and iterate.
That mechanism is the article. The model name is just the sticker on the machinery.
DeepRule starts where normal pricing models get uncomfortable
Classical assortment and pricing models usually want the world to be conveniently formatted. They expect transaction histories, customer-product choice pairs, structured features, reasonably stable demand assumptions, and constraints that can be written down before lunch.
Traditional retail does not always cooperate.
The paper frames four practical gaps. First, critical customer and distributor information often appears as unstructured text: negotiation records, sales assessments, price applications, approval documents, reward and penalty records. Second, demand features are entangled: product attributes, region, discounting, customer profile, timing, and price elasticity do not move politely in separate columns. Third, business constraints are layered: total purchase values, category coverage, SKU limits, policy-locked items, distributor incentives, and cost allocations. Fourth, black-box recommendations are difficult to operationalize because people still need to inspect, approve, and explain the rule.
DeepRule’s answer is not a single algorithm. It is a stacked workflow:
| Stage | What it does | Why it matters operationally |
|---|---|---|
| Knowledge fusion | Uses LLMs to parse unstructured customer, distributor, approval, and product information into structured priors and feature vectors | Converts “business text” into usable decision inputs |
| Demand prediction | Uses DNN-based modeling and rule-guided augmentation to forecast shipment or sales volume | Gives the optimizer a demand estimate instead of a slogan |
| Rule search | Uses symbolic regression and LLM-guided structure generation to create interpretable assortment-pricing functions | Produces auditable rules rather than opaque recommendations |
| Parameter optimization | Fixes the generated rule structure and tunes continuous parameters with optimization methods such as SQP/BFGS | Lets solvers do what solvers are good at |
| Constraint-aware allocation | Applies business rules such as purchase bounds, category limits, and SKU selection constraints | Keeps outputs feasible in the messy world where invoices happen |
| Reflection and memory | Stores prior rule modifications, loss changes, and segment-level errors for future iterations | Prevents the system from rediscovering the same mistake with great confidence |
This is the part many readers may miss: the LLM is not replacing the optimization layer. It is helping build the search space in which optimization becomes more business-readable.
That distinction matters. “LLM as pricing oracle” is fragile. “LLM as parser, rule-structure generator, and reflection engine inside a constrained decision pipeline” is much less glamorous and much more plausible.
The first move: turn business text into decision features
DeepRule begins with a customer and SKU representation problem. The framework takes inputs such as customer public information, historical records, price applications, approval documents, geolocation coordinates, demographic profiles, and SKU attributes. A locally deployed LLM extracts priors from these materials: store affiliation assumptions, reward functions, decision basis sets, brand preferences, business scope, and approval signals.
The paper then combines those extracted priors with more conventional feature engineering. Store-customer affiliation is estimated using distance, business scale, and LLM-derived priors. Demographic profiles are aggregated over stores within a customer’s operational radius. Temporal behavior is added through window statistics, Fourier components, and sequential features. SKU attributes are represented through embeddings rather than brittle one-hot encodings, especially for style-like attributes such as color, pattern, or fragrance.
This is not the most elegant part of the paper mathematically, but it is one of the most business-relevant. In many companies, the hardest part of pricing automation is not choosing the optimizer. It is converting human-operational knowledge into model-usable structure without destroying its context.
DeepRule’s pipeline says: do not pretend the messy text does not exist. Parse it, turn it into priors, merge it with structured data, and keep it inside the loop.
That is a useful lesson even if a firm never implements DeepRule exactly.
The second move: predict sales without confusing sales with price arithmetic
After feature construction, the framework uses a DNN to predict shipment or sales volume. The paper emphasizes a “feature-decoupled” sales volume model: instead of letting revenue dominate the learning process simply because revenue equals volume times price, the model tries to learn shipment dynamics directly and then feed those predictions into downstream optimization.
This is a subtle but important design choice. In retail pricing, a model can look smart by learning price-scaled revenue artifacts rather than demand behavior. That is like mistaking a thermometer for the weather: technically related, practically dangerous.
DeepRule also adds two feedback mechanisms around the prediction model.
First, it uses rule-prior-guided data augmentation. Business rules and LLM-generated pseudo-labels help expand training information where manual annotation is expensive. The paper gives a theoretical error-control discussion around pseudo-label bias, though this should be read as support for the design logic rather than as proof that every generated label is safe.
Second, it adds posterior cleaning and validation. Low-confidence samples are sent to an LLM-based cleaning process; invalid samples can be corrected or removed; ambiguous samples go to manual review; verified outcomes update the rule base. That last detail is not decorative. It is what stops the system from becoming an automated error recycling plant, which is a surprisingly common architecture in poorly governed AI systems.
The key operating principle is simple: prediction is not a one-shot model output. It is a maintained asset.
The third move: generate rules people can audit
The most interesting technical layer is the assortment-pricing rule search.
DeepRule formalizes a strategy as a function that takes customer, SKU, market, inventory, and cost features, then outputs two things: whether to recommend stocking a given SKU and what price to recommend. This is where symbolic regression enters. Instead of returning an opaque score, symbolic regression searches for mathematical or logical expressions that can be inspected.
The paper compares several ways to search this rule space:
| Method | Role in the paper | Practical interpretation |
|---|---|---|
| Evolutionary algorithm | Comparison method for symbolic rule evolution | Searches expression trees through mutation, pruning, and selection |
| Reinforcement learning | Comparison method for sequential expression generation | Treats partial rule construction as a decision process |
| GPT + Monte Carlo sampling | Hybrid comparison using LLM guidance with sampling/search | Shows whether generic LLM-guided exploration helps |
| Direct LLM reasoning | LLM-generated rule construction | Tests whether LLMs can directly propose useful decision logic |
| LLM structure + optimizer | Main DeepRule-style mechanism | Lets the LLM propose rule structure, then lets optimizers tune parameters |
| Fine-tuned LLM structure + optimizer | Variant/ablation-like comparison | Tests whether fine-tuning adds much over untuned structure generation |
The mechanism-first reading is essential here. The paper’s strongest claim is not “LLMs are better at math than RL.” The paper is more careful than that. It says LLMs are good at generating logical structure: conditionals, feature interactions, segmentation logic, and semantically plausible rule skeletons. They are weaker at hyperparameter tuning and may oscillate locally. So DeepRule separates the jobs.
The LLM proposes a constrained functional structure, such as combinations of linear terms, cross terms, sigmoid or threshold functions, and feature-region indicators. Then an optimizer solves the continuous parameter problem. If the tuned rule performs poorly, the LLM reconstructs the framework using the previous structure, evaluation feedback, and gradients or loss information.
That division of labor is the paper’s best engineering idea. Let language models organize business logic. Let numerical optimizers tune numbers. Radical, I know: the screwdriver does screwdriver work.
The example rule is basically a retail decision factory
The paper gives an example hierarchical optimization rule, and it is worth treating as an implementation detail rather than a second thesis.
The rule pipeline has four stages:
- Data preprocessing: filter candidate customers, merge sales, material, customer, and cost data, compute unit prices, match costs using string similarity, and form a usable dataset.
- Sales forecasting: create customer-material pairs, simulate discount scenarios such as 0.9, 0.95, and 1.0, predict sales under normalized features, and use baseline predictions for downstream allocation.
- Cost and fee-ratio calculation: compute gross revenue, category proportions, and expense-to-revenue fee ratios by hierarchy.
- Constrained optimization: allocate recommended sales amounts under global purchase bounds, category-level bounds, policy-locked SKU constraints, and high-margin material selection rules.
One specific constraint is especially business-readable: total recommended purchase amounts are bounded around historical averages. In the full experimental setting, the paper uses a 25% tolerance in one setup; in the example rule, it shows a tighter 0.95 to 1.05 bound around historical average order value. That is exactly the kind of constraint real teams care about. A recommendation that maximizes profit by asking a distributor to buy three times its usual monthly volume is not an optimization result. It is a hallucination with a purchase order attached.
The example also selects high-margin materials by comparing margin rates against category fee ratios plus a buffer, caps the number of SKUs per category, and fills gaps when candidate sales do not meet lower category bounds. This is where DeepRule becomes less like “AI pricing” and more like an executable version of a senior planner’s checklist.
That is a compliment.
The experiments mainly test whether the loop beats isolated methods
The paper reports three evidence layers. They should not be treated as equal.
| Evidence layer | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Rule-generation comparison | Main evidence for search-method choice | LLM-guided structural search improves search efficiency versus evolutionary, RL, GPT+Monte Carlo, and direct LLM approaches | It does not prove LLMs are universally better optimizers |
| Full-pipeline comparison | Main end-to-end business evidence | DeepRule-style LLM reasoning + optimizer improves profit-sales trade-offs against several baselines | It does not prove cross-industry generalization |
| LLM base-model comparison | Ablation / robustness test | Performance is not entirely dependent on one specific LLM | It does not prove model choice is irrelevant in all deployments |
| Prompt, memory, and compiler observations | Implementation detail / exploratory extension | Constrained prompts, semantic context, and memory likely help rule evolution | It does not quantify a general theory of prompt design |
| Golden-response policy distance appendix | Additional evaluation observation | DeepRule-style fitted rule generation reduces action-distance faster in the reported setting | It does not replace live long-term business validation |
The first rule-generation experiment compares six methods after 50 iterations, using historical shipment records, SKU categories, customer features, and constraints such as purchase amount limits, maximum SKUs per primary category, and minimum category coverage. The reported metrics include total sales volume, total profit, constraint-violating customer count, and the first epoch needed to reach sales and profit thresholds.
Figure 3 is the important result visually. It shows that the LLM structure + optimizer variants reach stronger profit and sales combinations with far fewer constraint violations than the evolutionary and RL baselines. The fine-tuned version is best in this comparison, but the untuned LLM structure + optimizer is close enough that the authors argue fine-tuning brings only marginal gains.
The paper also makes a useful admission: LLMs are good at rule structure, but weaker at hyperparameter optimization and prone to local oscillation. That is not a weakness of the article; it is one of the more credible parts of the research. Systems that admit where the LLM is bad tend to be more useful than systems that quietly hope nobody asks.
The full-pipeline numbers show a trade-off, not a miracle
The main end-to-end comparison uses historical data from a paper manufacturing company’s online feedback system. The task recommends assortments from 215 SKUs to 2,362 customers, combining shipment sequences, store demographic/geographic profiles, and business tags. DeepRule is evaluated as an untuned LLM + optimizer at 50, 100, and 300 iterations, against low-rank bandit, context-based clustering, model-free assortment pricing, and systematic B2B recommendation baselines.
The reported table is worth reading carefully:
| Method | Test total sales volume $(1.0 \times 10^6)$ | Test total profit $(1.0 \times 10^5)$ |
|---|---|---|
| Low-rank bandit | 7.450 | 4.174 |
| Context-based clustering | 5.128 | 5.258 |
| Model-free assortment pricing | 7.160 | 4.882 |
| Systematic B2B recommendation | 5.978 | 5.420 |
| LLM reasoning + optimizer, N=50 | 6.223 | 5.181 |
| LLM reasoning + optimizer, N=100 | 6.609 | 5.366 |
| LLM reasoning + optimizer, N=300 | 6.849 | 5.643 |
The obvious but wrong reading is: DeepRule wins because it has the highest profit at N=300. The more useful reading is: each baseline has a personality.
Low-rank bandit and model-free assortment pricing push higher sales volume, but with lower profit. Context-based clustering produces higher profit than those sales-heavy methods, but at a much lower sales volume. The systematic B2B recommendation baseline is more balanced. DeepRule gradually moves toward a better frontier: at 50 iterations it already beats clustering on sales but not profit; at 100 iterations it passes clustering on both and roughly matches B2B profit with higher sales; at 300 iterations it posts the highest reported profit, with sales volume still below low-rank bandit but much more profitable.
So the practical message is not “DeepRule sells the most.” It does not. The practical message is that DeepRule appears to search toward a better profit-feasible trade-off under business constraints.
For managers, that distinction matters. A pricing system that increases volume while compressing margin may look wonderful in a dashboard and miserable in the finance office. DeepRule’s evidence is more relevant to businesses that care about constrained profitability, not just throughput.
The model ablation says the architecture matters more than the brand badge
The paper compares six LLM base models at 50 iterations: GPT-4o, DeepSeek-R1, Gemini, Qwen3-32B, Claude-3.5, and Llama4-Scout. The results are:
| Model | N=50 test total sales volume $(1.0 \times 10^6)$ | N=50 test total profit $(1.0 \times 10^5)$ |
|---|---|---|
| GPT-4o | 6.345 | 5.426 |
| DeepSeek-R1 | 6.220 | 5.230 |
| Gemini | 6.171 | 5.095 |
| Qwen3 | 5.960 | 4.991 |
| Claude | 6.186 | 5.311 |
| Llama4-Scout | 5.992 | 4.687 |
The authors interpret the differences as mostly below 5%, except for a larger profit gap in Llama4-Scout. Their conclusion is that the gains come from general reasoning and semantic structuring capabilities common to current LLMs, not from one model’s special mathematical talent.
That interpretation is plausible, but should be applied with care. The result supports architectural robustness within this experiment. It does not mean procurement teams can swap models casually in production without retesting. Model behavior around structured code generation, field naming, constraint following, and long-context memory can vary in ways that matter operationally.
Still, the ablation is useful. It shifts attention away from “Which LLM is smartest?” toward “Did we design the decision loop correctly?” That is usually the better question, because model rankings age faster than yogurt.
What businesses can actually take from DeepRule
DeepRule is most relevant to businesses with three characteristics.
First, the company has operational knowledge trapped in messy artifacts: contracts, approval text, sales notes, distributor feedback, and exception rules. If all useful information is already cleanly structured, DeepRule’s LLM extraction layer is less distinctive. But in traditional retail, manufacturing distribution, and B2B channel sales, messy knowledge is often the default database.
Second, the decision must remain auditable. Pricing and assortment recommendations affect margins, channel conflict, inventory, sales incentives, and customer relationships. A black-box recommendation that cannot be explained may be technically impressive and politically unusable. Symbolic rules and decision trees give managers something to inspect, challenge, and revise.
Third, the optimization problem is constrained by reality. Minimum category coverage, SKU caps, purchase amount bounds, policy-locked products, fee ratios, and historical order ranges are not annoying edge cases. They are the job.
Here is the business pathway, separated cleanly:
| Layer | What the paper directly shows | Cognaptus interpretation | Remaining uncertainty |
|---|---|---|---|
| Data conversion | LLMs can parse unstructured business materials into priors and features inside the framework | LLMs are valuable as “business data translators,” not only chat interfaces | Extraction quality and validation cost depend on local document quality |
| Rule search | LLM-guided structure + optimizer performs well against several search baselines | LLMs can reduce the cost of exploring interpretable business-rule spaces | Search behavior may vary by domain, feature semantics, and constraint complexity |
| End-to-end performance | The framework improves reported profit-sales trade-offs in a paper-industry retail setting | The ROI case is strongest where current rules are manual, stale, or hard to personalize | Cross-domain transfer is not yet proven |
| Model choice | Several LLMs produce broadly similar results in the ablation | Architecture may matter more than brand-name model selection | Production model substitution still needs local testing |
| Interpretability | Symbolic rules and constrained functions are central to the design | Adoption improves when decision logic can be audited | Interpretability can still become complex if generated rules grow too large |
The most practical lesson is not “buy DeepRule.” The practical lesson is to redesign AI decision systems around closed loops: extraction, prediction, rule generation, optimization, evaluation, memory, and governance.
That is where the business value lives.
Where the evidence is strong, and where it is still thin
The paper is strongest as an architectural case study for traditional retail optimization under incomplete digitalization. It shows a thoughtful integration of LLM extraction, predictive modeling, symbolic regression, constrained optimization, and iterative reflection. It also reports concrete end-to-end performance numbers against relevant baselines.
But several boundaries matter.
First, the main validation is domain-specific. The authors report evidence from a physical paper-industry setting, using a paper manufacturing company’s online feedback system. That is valuable, because real operational data is more interesting than toy benchmarks. But it is not the same as proving the framework transfers cleanly to fashion, grocery, electronics, pharmaceuticals, or auction mechanisms.
Second, some experimental reporting is compressed. The paper gives meaningful tables, constraints, baseline descriptions, and appendix configurations, but it does not fully expose every implementation detail needed to audit baseline fairness, variance, deployment conditions, and long-term causal business impact. The reported experiments repeat runs and use means, but the article does not present the kind of statistical uncertainty analysis a cautious enterprise buyer would want before changing a pricing workflow.
Third, the framework depends on the quality of the knowledge loop. If LLM extraction introduces wrong priors, if pseudo-labels reinforce old bias, if manual review is under-resourced, or if business constraints are poorly encoded, the system may produce beautifully auditable nonsense. Auditable nonsense is still nonsense. It just comes with better formatting.
Fourth, the system’s improvement comes through iterations. The paper notes that real-world product selection and pricing have delayed cycles and can tolerate higher per-round computation time. That may be true in monthly B2B assortment planning; it may be less true in fast-moving retail settings where daily or intraday repricing matters.
Finally, the authors themselves point to open directions: cross-domain generalization, adding quantitative business insight into the refinement phase, and building domain-specific generative models for data-rich settings. Those are not footnotes. They are the roadmap of things not yet solved.
The deeper lesson: AI should not replace rules; it should make rule-making less primitive
Many businesses still run on rules. Discount rules. Approval rules. Product coverage rules. Customer priority rules. Regional exception rules. Rules that were once strategic and then became folklore.
The AI dream is often framed as replacing those rules with a black-box model. DeepRule suggests a better direction: use AI to discover, generate, test, optimize, and revise rules.
That is less theatrical than “autonomous pricing agent.” It is also more deployable.
The paper’s best idea is the separation of responsibilities. LLMs parse language-rich business context and propose human-readable structures. DNNs model nonlinear demand. Optimizers tune parameters. Constraint logic protects feasibility. Memory modules reduce repeated mistakes. Human validation handles ambiguous cases. Symbolic rules keep the result inspectable.
This is what serious AI automation increasingly looks like: not one giant model sitting on a throne, but a disciplined pipeline where each component is allowed to be useful and prevented from becoming imperial.
For Cognaptus readers, the takeaway is straightforward. If your business process still depends on scattered documents, tacit manager knowledge, and manually updated rules, the opportunity is not merely to “add an LLM.” The opportunity is to build a rule-generation loop that turns experience into structured, testable, auditable decision logic.
Rule of thumb does not disappear. It gets compiled.
Cognaptus: Automate the Present, Incubate the Future.
-
Yusen Wu and Xiaotie Deng, “DeepRule: An Integrated Framework for Automated Business Rule Generation via Deep Predictive Modeling and Hybrid Search Optimization,” arXiv:2512.03607, 2025. https://arxiv.org/abs/2512.03607 ↩︎