Warehouse.

That is a better place to start than “large language models for combinatorial optimization,” because the business problem is not philosophical. A warehouse has stacks, access directions, priorities, robots, blocked items, and deadlines. Someone has to decide which unit load moves first, which move creates future trouble, and how to search through the possible rearrangements without melting the compute budget.

The tempting AI story is familiar: ask a large model to “solve the optimization problem.” It sounds modern. It is also the kind of sentence that makes operations researchers quietly close the browser.

The more interesting story in Bömer et al.’s paper, Algorithmic Prompt-Augmentation for Efficient LLM-Based Heuristic Design for A Search*, is narrower and more useful.1 The model is not asked to replace the search algorithm. It is asked to generate a heuristic function that helps A* search decide which states to explore first. The search algorithm still does the disciplined work. The LLM proposes small pieces of code. An evolutionary loop evaluates them. Bad ideas are killed without ceremony. Good ones survive.

That is the article’s central lesson: in constrained optimization, the practical unit of AI value may not be the answer. It may be the scoring function.

And the paper’s most useful twist is that the LLM performs better when it sees the algorithmic environment in which that scoring function will operate. Not just a description of the problem. Not just a motivational paragraph about being an expert. The actual A* context: the open list, node expansion, neighbor generation, goal test, objective calculation, and the place where the heuristic value enters $f(n) = g(n) + h(n)$.

Apparently, “please be smart” is not a systems architecture. Tragic, but educational.

Most early LLM-based heuristic-design work is easier to understand if we imagine the model helping with constructive heuristics. The model proposes rules that build a solution step by step. For problems such as the Traveling Salesperson Problem or online bin packing, a bad constructive decision may hurt quality, but the process can often still produce some feasible output.

The paper argues that this is not enough for a broader class of operational problems. Many real search spaces are constrained. Some moves are illegal. Some early decisions cut off useful future paths. A greedy rule can become confidently wrong, which is still wrong, only more efficiently packaged.

This is why the authors study A* guiding heuristics instead of ordinary constructive heuristics. In A*, each node has an accumulated cost $g(n)$ and a heuristic estimate $h(n)$. The algorithm uses their combination to prioritize the search:

$$ f(n) = g(n) + h(n) $$

The LLM-generated function does not directly output the whole plan. It scores states. That score changes the order in which A* expands the search tree. In business language, it does not replace the planner; it changes the planner’s attention.

The authors test this idea on two deliberately different domains:

Domain Why it matters What makes it useful for evaluation
Unit-Load Pre-Marshalling Problem (UPMP) A warehouse logistics problem where unit loads must be rearranged so high-priority items are not blocked Practical, relatively new, and sparsely represented in likely LLM training data
Sliding Puzzle Problem (SPP) A classic tile-rearrangement search problem Well-studied benchmark with known heuristic traditions, but tested here at large $20 \times 20$ scale where pattern-database heuristics become impractical

This pairing is important. UPMP asks whether the method can help in a niche operational domain with limited existing heuristic knowledge. SPP asks whether it can still help in a domain where decades of heuristic-search research already exist, but where the instance scale makes some standard methods less usable.

So the paper is not merely “LLMs discover heuristics.” It is closer to: LLMs can be placed inside a controlled heuristic-design loop for constrained search, and the way we expose the algorithm to the model changes the quality of what comes out.

The mechanism: give the model the search procedure, not just the task description

The authors build on Evolution of Heuristics (EoH), where an LLM generates candidate heuristic code and a corresponding natural-language “thought.” Those candidates are evaluated on problem instances. Better heuristics are retained. New generations are produced through prompt strategies that either explore new ideas or modify existing ones.

The paper compares four prompt settings:

Framework What the LLM receives Likely role in the design loop
EoH Basic task description, heuristic code/thought format, evolutionary procedure Baseline automated heuristic evolution
P-CEoH EoH plus problem-specific description Adds domain semantics and constraints
A-CEoH EoH plus algorithmic context from A* Shows how the heuristic is used inside the search process
PA-CEoH Both problem-specific and algorithmic context Combines domain meaning with algorithm mechanics

The new contribution is A-CEoH: Algorithmic-Contextual Evolution of Heuristics. Instead of only telling the LLM what problem it is solving, A-CEoH shows the model the A* driver and relevant methods such as is_goal(), get_neighbors(), reconstruct_path(), and get_objective_value().

That detail matters because a heuristic function is not an isolated poem. It is called inside a specific computational routine. It affects which node enters the priority queue. It interacts with visited-state logic, node caps, timeouts, and objective measurement. If the model cannot see those interfaces, it may generate plausible code that does not guide the actual search very well.

A-CEoH’s promise is therefore not “more context is always better.” That sentence is popular because it is easy and nearly useless. The sharper claim is that algorithmic context can teach the model how its generated heuristic will be consumed.

That is a different kind of prompt engineering. Less incense. More plumbing.

The experiments are not one result; they are four different tests

The paper’s experimental section is best read as a sequence of tests with different purposes. Mixing them together would make the results sound either too magical or too vague.

Test Likely purpose What it supports What it does not prove
Context comparison across EoH, P-CEoH, A-CEoH, PA-CEoH Main evidence Algorithmic context improves generated heuristic quality over the baseline, especially for Qwen2.5-Coder:32b That the same prompt design will dominate on every search algorithm or domain
Fitness over generations for Qwen2.5-Coder:32b on UPMP Diagnostic / mechanism evidence A-CEoH helps produce reasonable heuristics early; P-CEoH helps later improvement; PA-CEoH combines both tendencies That convergence behavior is universal across models and problems
Comparison with human-designed heuristics Comparison with prior work Generated heuristics can be competitive with or better than selected handcrafted heuristics under the paper’s target settings That generated heuristics are globally superior to all human-designed heuristics
Token usage Implementation detail / cost signal More context increases input tokens but can reduce output length, suggesting more focused generation That the approach is automatically cheaper in production

This distinction is useful because the headline result — LLM-generated heuristics outperforming handcrafted ones — is attractive, but it is not the whole story. The business lesson sits in the mechanism and evaluation discipline.

On UPMP, combined context gives the strongest result

For the Unit-Load Pre-Marshalling Problem, A-CEoH consistently outperforms the EoH baseline across the evaluated models in both best-found heuristic and median fitness. The combined PA-CEoH setting performs best overall, with multiple runs reaching the best reachable fitness value of 0.0815.

The interpretation is intuitive. In UPMP, algorithmic context helps the model understand how the heuristic affects A*. Problem context helps it understand what warehouse states and constraints mean. Combining both gives the model two useful maps: one of the search machine, one of the problem world.

The generation dynamics reinforce this. For Qwen2.5-Coder:32b on UPMP, A-CEoH produces reasonably good heuristics early but struggles to keep improving. P-CEoH starts more like the baseline but supports more reliable later improvement. PA-CEoH benefits from both patterns: better early direction and stronger later refinement.

That is a mechanism worth remembering. Algorithmic context may help the model enter the right region of the search-design space. Problem context may help it refine within that region. The exact phrasing is mine, not the authors’, but it matches the pattern their results describe.

The comparison with a prior optimal A* implementation is also informative, but it should be read carefully. On the UPMP training instances, the best Qwen2.5-Coder and GPT-4o PA-CEoH heuristics solve all 10 cases with the same reported fitness as the optimal A* baseline, while running faster on average. On the additional 30 test instances, both LLM-generated heuristics solve all 30 cases, while the optimal implementation solves 27 within the evaluation limit. However, among solved instances, the optimal implementation still reports a lower fitness than the Qwen-generated heuristic and a slightly lower fitness than the GPT-4o-generated heuristic.

So the correct business reading is not “LLMs beat optimal search.” That would be the kind of sentence that deserves immediate quarantine.

The correct reading is more precise: under these test settings, generated guiding heuristics can improve practical solve coverage and runtime behavior, while maintaining near-optimal quality on the evaluated instances. For operations systems, that is often exactly the tradeoff that matters. A theoretically elegant method that fails within the time limit is not a plan; it is a very principled timeout.

On large sliding puzzles, algorithmic context beats problem description for Qwen

The Sliding Puzzle Problem behaves differently. Here, the best heuristic comes from Qwen2.5-Coder:32b under A-CEoH, with a reported fitness value of 0.445. P-CEoH does not improve heuristic quality for Qwen in this setting. GPT-4o gains moderately from prompt augmentation but does not surpass a fitness value of 0.6. Gemma2:27b mostly struggles, except for a few isolated P-CEoH runs.

This matters because SPP is not an obscure warehouse problem. It is a classic search benchmark. The model may already have latent exposure to puzzle heuristics such as Manhattan distance or linear conflict. Adding more problem description may not be the missing ingredient. The missing ingredient may be the execution context: how the heuristic plugs into the A* implementation used in the experiment.

The comparison with human-designed heuristics is especially interesting. On seeds 0–9, the generated A-CEoH Qwen heuristic solves 10 instances, while both the hybrid heuristic and the optimal A* implementation solve 5 under the reported constraints. The generated heuristic is much faster on average than the two baselines. But quality needs nuance: on the five instances solved by both the optimal method and the generated heuristic, the generated heuristic’s fitness is 0.281, compared with the optimal method’s 0.252. The gap is small in this subset, but it exists.

On seeds 10–19, the generated heuristic solves 7 instances with fitness 0.392 and average time 16.272 seconds. The hybrid heuristic solves 6 with fitness 0.479 and average time 189.51 seconds. The optimal method solves 6 with fitness 0.375 and average time 187.91 seconds.

Again, the practical result is not “the LLM found the mathematically best heuristic.” The result is more operationally relevant: the generated heuristic expands the set of solved instances under the time limit and runs much faster, while staying close to the optimal baseline in quality where comparison is possible.

This is exactly where business teams often live: not in the abstract kingdom of optimality, but in the swamp of deadlines, compute limits, and acceptable solution quality.

Smaller coding models can win when the prompt shows the machine

One of the paper’s most useful findings is that Qwen2.5-Coder:32b can match or outperform GPT-4o in the studied settings. This is not a generic claim that smaller models are better. It is a conditional claim: in this evolutionary heuristic-design workflow, with structured algorithmic context, a coding-oriented model performs extremely well.

That should change how managers think about AI model selection for technical automation.

The default procurement instinct is to buy the largest general model available and hope it behaves like a universal optimizer. The paper suggests a better question: does the model see the right interfaces, constraints, and evaluation feedback?

Model capacity still matters. But context design and evaluation design may matter more than many teams expect. A smaller model that sees the actual algorithmic environment may outperform a larger model that receives a more generic prompt. That is mildly inconvenient for anyone selling model size as strategy. Fortunately, inconvenience is not evidence.

The token results add a cost dimension. For Qwen2.5-Coder:32b, adding context substantially increases input tokens. On UPMP, the mean input tokens per prompt rise from 1,046 in EoH to 2,223 in PA-CEoH. On SPP, they rise from 1,477 to 3,330. Output tokens generally fall, which the authors interpret as more focused generation.

This is the practical cost of making the model less vague. You pay more input tokens to describe the machine. In return, you may get shorter, more targeted outputs and better heuristics. That is not free. It may still be cheaper than months of manual heuristic engineering, but the paper does not prove that ROI case directly. It gives the technical ingredient; the economics still need deployment-specific accounting.

The business value is controlled heuristic discovery, not autonomous optimization theater

For business readers, the important distinction is between three different claims:

Level Claim Status
What the paper directly shows LLMs in an EoH-style loop can generate A*-guiding heuristics for UPMP and SPP, and algorithmic context improves results in the studied settings Supported by the paper’s experiments
What Cognaptus infers Companies using search algorithms may use LLMs to generate candidate heuristic functions inside a controlled benchmark-and-selection pipeline Reasonable extrapolation
What remains uncertain Whether the approach generalizes across other algorithms, instance distributions, production constraints, and safety requirements Not proven by this paper

This separation matters. The result is powerful precisely because it is not a chatbot fantasy. The LLM is not asked to manage the warehouse. It is not asked to “think strategically” about inventory. It writes candidate scoring functions. Those functions are executed, measured, ranked, and replaced if they fail.

A practical workflow would look something like this:

  1. Identify the search or planning algorithm already used in the operation.
  2. Expose the LLM to the relevant algorithmic interface, not the entire codebase.
  3. Ask it to generate candidate heuristic functions with strict input-output requirements.
  4. Evaluate each candidate on representative historical and synthetic instances.
  5. Compare against existing human-designed baselines.
  6. Keep only heuristics that improve solve rate, runtime, or quality under defined limits.
  7. Monitor performance drift when instance distributions change.

This is not glamorous. That is one reason it might work.

In logistics, scheduling, routing, warehouse automation, and resource allocation, many firms already have rule-based or search-based systems that are brittle but valuable. Replacing them wholesale with agentic improvisation would be unwise. But using LLMs to generate better components for those systems is much more plausible.

The paper’s architecture points toward that component-level AI strategy. The LLM becomes a design assistant for heuristics. The algorithm remains the executor. The benchmark is the judge.

Algorithmic context is not the same as dumping the codebase into the prompt

There is a dangerous shallow reading of this paper: “Put code in the prompt.”

No. Please do not turn this into another context-window landfill.

The authors include targeted algorithmic context: the A* driver procedure and the key methods that define goal checking, neighbor generation, path reconstruction, and objective value. This context is compact enough to show how the heuristic will be used, but structured enough to expose the algorithm’s dynamics.

That is different from pasting a whole repository into a model and hoping the transformer develops taste.

For enterprise use, the design question is: which part of the system does the model need to understand to generate a useful component? In this paper, the answer is the interface between heuristic score and search behavior. For a scheduling system, it might be the dispatch rule and constraint checker. For a routing engine, it might be the local-search move operator and feasibility repair logic. For a trading execution simulator, it might be the state transition and cost model. No, that does not mean letting a model trade live because it once saw a function signature. We remain adults.

The broader pattern is reusable: provide the model with the smallest algorithmic slice that makes its generated component operationally meaningful.

Where the result should not be overextended

The limitations are not decorative; they define how the paper should be used.

First, the evidence comes from two domains. They are well chosen, but still only two. UPMP and SPP reveal different strengths of algorithmic and problem context, but they do not cover all constrained optimization problems.

Second, the generated heuristics are evaluated under specific instance configurations, time limits, node limits, and model choices. A heuristic that performs well on a 5 × 5 single-bay UPMP layout with five priority classes and 60% fill rate may not transfer cleanly to different warehouse structures. Distribution shift is not impressed by publication tables.

Third, A* has its own theoretical subtleties. If a heuristic is admissible, A* can preserve optimality guarantees. LLM-generated heuristics may improve practical guidance without being admissible. That can be acceptable in business contexts where speed and feasibility matter, but it changes the guarantee. The paper’s strong practical results should not be casually translated into “optimal planning by LLM.”

Fourth, the approach adds token cost and evaluation cost. The evolutionary loop uses many prompts: initialization plus 20 generations, with exploration and modification strategies generating 1,600 prompts per run. The paper shows better heuristic quality, not a full production cost-benefit analysis.

Finally, human oversight remains necessary. The point of the evolutionary loop is that generated code is not trusted by default. It is tested. This may sound obvious, but in AI product discussions, obvious things often need bodyguards.

The better takeaway: architecture beats vague intelligence

The paper’s best contribution is not that one model beat another model. Benchmark leaderboards are useful, but they age like milk.

The more durable contribution is architectural: LLMs can be useful in optimization when they are embedded into a system that constrains their role, exposes the right context, and evaluates outputs against hard instances. Algorithmic context gives the model a better understanding of how its code affects search. Problem context gives it semantic grounding. Evolutionary selection turns generation into iterative design rather than one-shot hope.

That is the mechanism-first lesson.

For businesses, the next step is not to ask whether an LLM can “solve logistics.” That question is too broad to be useful and too vague to be falsifiable. The better question is narrower:

Which small heuristic, scoring rule, ranking function, or move-selection policy inside our existing optimization system is costly to design manually, easy to evaluate automatically, and valuable if improved?

That is where this paper points. Not toward autonomous optimization theater, but toward controlled heuristic discovery.

The new heuristic is not just the function the model writes. It is the whole procedure around it: expose the algorithm, generate candidates, evaluate brutally, keep what works, and let the search engine do its job.

The model gets context. The algorithm gets discipline. The business gets a component that can be tested before anyone lets it near the warehouse.

A rare arrangement. Almost suspiciously sensible.

Cognaptus: Automate the Present, Incubate the Future.


  1. Thomas Bömer, Nico Koltermann, Max Disselnmeyer, Bastian Amberg, and Anne Meyer, “Algorithmic Prompt-Augmentation for Efficient LLM-Based Heuristic Design for A* Search,” arXiv:2601.19622, 2026. https://arxiv.org/abs/2601.19622 ↩︎