When AI Packs Too Much Hype: Reassessing LLM 'Discoveries' in Bin Packing

A warehouse manager, a cloud scheduler, and a container-ship planner all know the same unpleasant truth: fitting things into limited capacity is where tidy strategy goes to die.

That is why bin packing remains such a useful test case. The problem is easy to explain and difficult to solve optimally. Items arrive. Bins have fixed capacity. The objective is to use as few bins as possible. In the online version, the system must decide where to place each item as it arrives, without seeing the future. This is not just a toy puzzle. It resembles production scheduling, memory allocation, server placement, freight consolidation, and every other operational setting where tomorrow’s workload has the bad manners not to disclose itself in advance.

So when LLM-based program search was reported to have found improved heuristics for online bin packing, the claim sounded plausible enough to be exciting. Perhaps an LLM, wrapped inside an evolutionary search loop, had found a genuinely new algorithmic pattern. Perhaps code generation was beginning to look less like autocomplete and more like machine-assisted discovery. A charming thought. Also a dangerous one.

In An In-depth Study of LLM Contributions to the Bin Packing Problem, Julien Herrmann and Guillaume Pallez revisit that claim and ask a less glamorous question: what did the LLM-generated heuristics actually do?¹ Their answer is not that LLMs are useless for optimisation. It is sharper than that. The LLMs appear to have produced empirically effective heuristics for narrow stochastic instances, but the apparent discovery collapses into a simple behavioural rule once inspected carefully. The impressive-looking programs were not so much transparent insights as coded artefacts requiring expert reverse-engineering.

The useful lesson is not “AI cannot help with optimisation.” That would be a lazy conclusion, and laziness already has enough venture funding. The better lesson is this: an LLM-generated heuristic is not a discovery until it survives interpretation, baseline comparison, and distribution shift.

The mechanism hiding inside the code

The paper focuses on heuristics produced by LLM-based evolutionary frameworks, especially FunSearch and later Evolution of Heuristics. FunSearch, introduced by Romera-Paredes and colleagues, combines a frozen LLM with an evaluator: the model proposes candidate programs, the system scores them, and the evolutionary loop keeps mutating the better candidates.² For bin packing, the generated programs are priority functions. Given an incoming item and the current state of available bins, the function assigns scores to candidate bins, and the item goes to the highest-scoring one.

That setup matters because the generated code is not a theorem. It is a decision rule discovered through search. The paper’s central move is to treat the code not as sacred machine wisdom but as an object of behavioural analysis: when does it open a new bin, when does it reuse an old one, and what hidden rule explains its choices?

The first target is the FunSearch heuristic called c12, trained for item sizes drawn from Uniform(20, 100) with bin capacity 150. At first glance, c12 looks like a list of hand-written thresholds. It assigns high priority to bins where the item would leave very little residual space, and lower priority elsewhere.

After inspection, Herrmann and Pallez show that c12 effectively behaves as a two-zone rule:

If an item can be placed so that the remaining capacity is very small, use a tight-fit strategy similar to BestFit.
If not, avoid bins that would leave a mid-sized leftover gap, and instead use a FirstFit-like choice among bins that remain sufficiently open.

In the paper’s analysis, c12 prioritises bins that would leave at most 7 units of remaining space. If none exist, it selects the first bin that would leave more than 21 units. Bins that would leave between 7 and 21 are essentially skipped. Several lines of the original c12 code are therefore not operationally useful. They look like logic. They behave like decorative plumbing.

That is the first crack in the discovery narrative. The value is not in the code’s surface complexity. The value, if any, is in the behavioural pattern: close bins tightly when possible, otherwise avoid creating unusable leftover space.

The authors then define a smoothed version of c12 and later generalise it into a family of two-parameter heuristics. The broad rule is:

allocate the item to the best-fitting bin if the fit is tight enough, controlled by threshold $a$;
otherwise allocate it only to bins with enough slack beyond the item, controlled by threshold $b$;
if neither condition holds, open a new bin.

For the Uniform case, their simplified ab-FirstFit reaches essentially the same performance as c12. In the reported setting, c12 improves over BestFit by about 2.0%, while ab-FirstFit with tuned thresholds improves by about 2.1%. This is not a rounding error in intellectual history. It is the point.

The Weibull heuristic is harder to read, not necessarily deeper

The second FunSearch heuristic, c14, targets item sizes drawn from Weibull(3.0, 45) with bin capacity 100. This one is much less interpretable. Its score function includes algebraic terms involving bin capacity and item size, then modifies scores based on adjacent bins in implementation order. That last move is especially opaque. A bin’s priority depends partly on the score of its neighbour. Somewhere, an operations researcher quietly reaches for coffee.

The paper’s analysis of c14 is important because it separates readability from interpretability. Yes, c14 is code. Yes, a human can read it. But reading code is not the same as understanding the operational principle. Herrmann and Pallez explicitly say their understanding of c14 came through experimental observation, not from the formula itself. They compare c14 item by item against WorstFit, examine where the two diverge, and infer the behaviour from those differences.

The resulting interpretation is again a threshold rule. In simplified terms:

if a perfect-fit bin exists, use it;
otherwise, if there is a bin whose remaining space exceeds the item size by more than roughly 20, use a WorstFit-like choice;
otherwise open a new bin.

This is not identical to c14, and the paper is careful about that. Unlike the c12 case, c14’s simplified interpretation differs non-trivially from the actual program. Still, the behavioural pattern is recognisable. It is another version of the same idea: only close a bin when the fit is very tight; otherwise preserve useful space for future items.

The later EoH heuristic for the same Weibull distribution uses a different score function but, according to the authors, captures essentially the same underlying principle. So the mystery becomes smaller. Different LLM-evolved formulas, same basic behavioural family. The machine did not hand over a compact conceptual insight. It handed over several ornate routes to a rule humans could express more cleanly after analysis.

What the experiments are actually doing

The paper’s experiments are best read as a staged diagnostic, not as one undifferentiated benchmark table. That distinction matters because different tests support different claims.

Evidence in the paper	Likely purpose	What it supports	What it does not prove
c12 threshold analysis and Figure 1	Mechanism extraction	c12 can be reduced to a tight-fit / skip-gap / FirstFit-like behaviour	That this rule is universally good for online bin packing
c12 vs Smooth c12 on Uniform(20,100)	Main comparison	The simplified version closely matches c12’s performance	That the LLM discovered the simplification explicitly
Performance as item count increases	Sensitivity test	The advantage depends strongly on large numbers of items	Robust gains on small operational workloads
c14 vs WorstFit item-by-item comparison	Behavioural diagnosis	c14 often opens a new bin where WorstFit would reuse one, revealing threshold-like behaviour	That c14 is easily interpretable from its code
ab-FirstFit and ab-WorstFit heatmaps	Parameter robustness / tuning evidence	Several threshold choices work comparably well in the studied distributions	Parameter-free generalisation
Cross-distribution comparison	Robustness and comparison with prior work	LLM-evolved heuristics are brittle; ab-style baselines generalise the extracted idea better	Guaranteed superiority outside distributions with suitable lower-bound structure
Laptop-time tuning versus LLM search	Implementation and cost evidence	The simpler heuristics are far cheaper to evaluate and tune	That all LLM-based heuristic search is inefficient in every setting

For Uniform(20,100), the best reported ab-FirstFit setting is $a=5$, $b=24$, giving a relative ratio of 0.979 against BestFit. In ordinary language: about 2.1% fewer bins than BestFit in that setting. c12 gives about 2.0%.

For Weibull(3.0,45), the simplified ab-WorstFit family reaches roughly the same improvement as c14. The figure caption reports best parameters around $a=1$, $b=21$ with a relative ratio of 0.967; the surrounding text reports $b=22$ and an average 3.3% gain. The exact threshold is not the business lesson. The lesson is that a two-parameter interpretable heuristic matches the LLM-evolved result.

The authors also report that EoH improves over BestFit by about 3.2% on the same Weibull setting, while c14 and ab-WorstFit are around 3.3%. Again, the trophy goes not to mystical code but to the boring thing that actually matters in operations: a simpler rule with comparable performance and lower cost.

The real dependency is distribution structure

The paper does not say the threshold rule is magic. It says the rule works under particular structural conditions.

The first condition is scale. The advantage appears when many items are scheduled. In the Uniform case, BestFit can outperform c12 when the number of items is small; the paper notes that below roughly 90 items, c12’s average performance is worse than BestFit. The same item-count dependency appears in the Weibull case, where the performance gap is even more pronounced for small numbers of items.

The second condition is the lower bound or effective scarcity of very small items. In Uniform(20,100), every item is larger than 20. If a placement leaves less than 20 units in a bin, that leftover space is unusable for any future item. Closing the bin tightly is therefore sensible. Leaving a middling gap can be wasteful because the gap may be too small for the next item but too large to ignore. The threshold rule exploits this structure.

For Weibull, the distribution is not bounded below in the same simple way, but the studied parameterisation contains relatively few very small items. The same intuition still helps: do not create leftover spaces that the future workload is unlikely to use.

This is where the “mathematical discovery” framing weakens. The result is less about a new general principle for online bin packing and more about exploiting known or learnable properties of a specific stochastic instance. The authors point out that these are not the canonical worst-case online bin-packing problem. They are narrow stochastic online bin-packing settings with known distributions. Prior literature had studied some uniform average-case settings, but not exactly these parameterisations, especially not the positive lower-bound Uniform case or the non-uniform Weibull case explored by FunSearch.

That absence of prior work is ambiguous. It could mean the LLM opened a neglected research path. It could also mean the instance was not previously considered very important. The paper leans toward the latter interpretation: the attention may have followed the AI-discovery narrative rather than the intrinsic significance of the bin-packing instance. A slightly cruel possibility, but not an irrational one.

Readable code is not the same as interpretable knowledge

One of the paper’s most useful distinctions is between human-readable output and human-interpretable insight.

Readable code is code a person can inspect. Interpretable code is code whose behaviour and rationale can be efficiently understood, predicted, and reused. These are different standards. The gap between them is exactly where many AI product claims become slippery.

c12 is relatively interpretable because its threshold structure is close to the surface. c14 is not. The authors had to conduct multiple rounds of behavioural experiments to infer what c14 was doing. They describe the crucial threshold around 20 or 21 as not being apparent in the code. That means the analysis resembled black-box reverse engineering, despite the output being source code.

For business users of AI-generated optimisation, this distinction is not academic pedantry. It determines whether a generated heuristic can be governed.

A logistics team can deploy a rule it understands. It can explain why the rule behaves differently when order sizes shift. It can tune thresholds when customer mix changes. It can audit bad outcomes. But a strange generated formula that happens to work on a benchmark is operational debt wearing a lab coat.

The same applies to cloud scheduling, procurement batching, route assignment, workforce rostering, and any setting where optimisation logic affects cost, service level, or compliance. If an LLM-generated heuristic cannot be reduced into a testable operational principle, the organisation has not acquired knowledge. It has acquired behaviour.

Behaviour can be useful. It can also break silently.

What this means for business AI teams

The practical takeaway is not that companies should avoid LLM-assisted optimisation. In fact, the paper implicitly supports one version of the workflow: let the LLM search process generate candidate behaviours, then have domain experts analyse, simplify, test, and turn them into reusable principles. That can be valuable. It is also not the same as autonomous discovery.

A useful business pathway looks like this:

Stage	What the AI system provides	What humans must still do	Failure mode if skipped
Candidate generation	Many possible heuristics or priority functions	Define evaluation conditions and operational constraints	Optimising for the wrong benchmark
Behavioural diagnosis	Empirical evidence that one candidate performs well	Identify the mechanism behind the gain	Treating local curve-fitting as insight
Baseline comparison	Performance against standard heuristics	Add domain-specific and simplified baselines	Overpaying for complexity
Distribution-shift testing	Results under changed workload assumptions	Decide where the rule is safe to deploy	Silent failure under new demand mix
Operationalisation	A rule or model embedded in workflow	Monitor, retune, and govern decisions	Algorithmic debt

The ROI relevance is also straightforward. FunSearch-style approaches can require thousands or millions of LLM calls and days of search. The paper reports that its own experimental runs, including meta-tuning the ab-style baselines, can be completed in minutes on a personal laptop. That cost contrast does not prove LLM search is never worth it. It proves that once a simple mechanism is available, continuing to worship the expensive generator is not strategy. It is procurement theatre.

For an enterprise, the central question is not “Did the AI beat BestFit by 3%?” The better questions are:

Does the gain survive outside the training distribution?
Is the rule simpler than the generated code?
Can the business explain when the rule should not be used?
Are we paying for discovery, or merely for a noisy search process that points experts toward a conventional idea?

The last question is uncomfortable because the answer may still be “yes, that is useful.” A noisy search process can be commercially valuable if it shortens exploration. But it should be priced and governed as search, not as insight.

The boundary of the paper’s criticism

Herrmann and Pallez are not offering a universal verdict on LLMs in science. Their critique is narrower and therefore stronger. They examine specific LLM-evolved heuristics for specific online bin-packing instances. The evidence supports a precise claim: in these cases, the generated programs do not provide sufficient evidence of mathematical discovery.

The paper’s boundary matters. It does not show that LLM-assisted evolutionary search cannot produce useful heuristics. It does not show that LLMs cannot contribute to scientific work. It does not even show that FunSearch is a bad technical idea. The authors explicitly acknowledge that extending genetic algorithm search operators with LLMs is original and worth investigating.

What it does show is that performance improvement alone is a weak basis for grand claims. A generated heuristic can outperform a standard baseline and still be narrow, opaque, overfit, or conceptually uninteresting. A benchmark win is evidence of performance under conditions. It is not automatically evidence of understanding.

The paper also points to a publication and perception problem. Subsequent LLM-evolution papers improved or modified the framework and compared against similar Weibull bin-packing benchmarks, but did not claim new mathematical discovery when surpassing c14. That asymmetry suggests the label “mathematical discovery” may have done more rhetorical work than scientific work. How surprising. A dramatic title travels further than a careful one; academia discovers marketing, again.

From discovery theatre to validation discipline

The deeper business relevance is not bin packing. It is validation culture.

LLMs are increasingly being used to generate code, policies, plans, workflows, test cases, trading strategies, and optimisation heuristics. Many of these outputs will look plausible. Some will perform well in local tests. A few may even be genuinely useful. The risk is that organisations confuse three different things:

A generated artefact: the model produced something executable.
An empirical improvement: the artefact performed better under a benchmark.
A transferable insight: the artefact reveals a mechanism that generalises.

Most AI adoption errors happen by treating step 2 as if it were step 3.

The bin-packing case gives a cleaner standard. If an LLM-generated rule works, reverse-engineer it. Strip it down. Compare it to strong human baselines. Test it under distribution shift. Ask whether a simpler version performs just as well. If the answer is yes, deploy the simpler rule and thank the LLM for its service. No statue required.

That is not anti-AI. It is pro-usefulness.

The bin is not the breakthrough

The most interesting result in Herrmann and Pallez’s paper is not that the LLM-generated heuristics can be beaten or matched. It is that their apparent sophistication dissolves into a simple operational mechanism: use tight fits when they are available, avoid mediocre leftover gaps when the distribution makes those gaps unlikely to be useful, and rely on scale for the advantage to emerge.

That is a sensible heuristic. It is also not a revolution.

For businesses, the message is refreshingly unsentimental. LLMs can be productive search companions. They can explore code spaces that humans would not manually enumerate. They can generate candidates worth inspecting. But the moment a company turns candidate generation into “AI discovered a new principle,” it has left engineering and entered theatre.

The better future is not AI replacing optimisation expertise. It is AI producing rough artefacts that experts can compress into cleaner mechanisms, cheaper rules, and more robust systems. The hard part remains what it has always been: knowing what problem you are solving, what evidence counts, and when a local improvement deserves a promotion to general knowledge.

In bin packing, as in business, not every full-looking box contains something valuable.

Cognaptus: Automate the Present, Incubate the Future.

Julien Herrmann and Guillaume Pallez, “An In-depth Study of LLM Contributions to the Bin Packing Problem,” arXiv:2510.27353, 2025. ↩︎
Bernardino Romera-Paredes et al., “Mathematical discoveries from program search with large language models,” Nature, 2024. ↩︎

The mechanism hiding inside the code#

The Weibull heuristic is harder to read, not necessarily deeper#

What the experiments are actually doing#

The real dependency is distribution structure#

Readable code is not the same as interpretable knowledge#

What this means for business AI teams#

The boundary of the paper’s criticism#

From discovery theatre to validation discipline#

The bin is not the breakthrough#