Opening — Why this matters now

The academic world has been buzzing ever since a Nature paper claimed that large language models (LLMs) had made “mathematical discoveries.” Specifically, through a method called FunSearch, LLMs were said to have evolved novel heuristics for the classic bin packing problem—an NP-hard optimization task as old as modern computer science itself. The headlines were irresistible: AI discovers new math. But as with many shiny claims, the real question is whether the substance matches the spectacle.

In October 2025, a new paper by Herrmann and Pallez dissected these claims with surgical precision. Their findings are a reminder that “interpretable code” isn’t the same as “scientific insight,” and that LLMs—like overzealous interns—can sometimes produce plausible-looking work that collapses under scrutiny.

Background — From heuristics to hype

The bin packing problem asks a simple question: given items of varying sizes, how can you fit them into the smallest number of fixed-size bins? It’s a staple of logistics, cloud resource allocation, and scheduling. For decades, researchers have developed heuristics—clever rules that yield good (if not optimal) results. The classics are “First Fit,” “Best Fit,” and “Worst Fit,” each a simple decision strategy.

FunSearch, introduced in Nature (Romera et al., 2023), combined an LLM with an evolutionary search algorithm—a genetic loop that mutates and selects better heuristics over time. The model generated code snippets that were then tested, scored, and refined. One such snippet, codenamed c12, seemed to outperform traditional heuristics under certain simulated conditions. Cue the headlines about AI-led discovery.

But Herrmann and Pallez weren’t convinced. They analyzed these so-called discoveries line by line—and found something both mundane and profound.

Analysis — What the paper actually does

Herrmann and Pallez began by reviewing the code FunSearch generated for two test cases:

  1. Uniform(20,100) distribution, bin capacity = 150 (heuristic c12)
  2. Weibull(3.0,45) distribution, bin capacity = 100 (heuristic c14)

Both instances were highly specific—so specific, in fact, that they’d never been studied before. That should have been a red flag: “discovering” a pattern in an unexamined niche doesn’t make it a new theorem.

They found that c12, despite its mystique, boils down to a two-threshold rule: fill bins tightly when possible, otherwise leave a comfortable gap. Once simplified, this rule can be written as a straightforward two-parameter heuristic, which they dubbed ab-FirstFit. The performance? Nearly identical—but with greater interpretability and vastly less computational cost.

c14, the so-called “Weibull breakthrough,” was worse. It contained opaque code lines—subtractions between adjacent bins, unexplained exponents—that confused even domain experts. After several rounds of empirical testing, Herrmann and Pallez reverse-engineered its essence: the same logic as c12, just with noisier algebra. The pattern, not the program, was familiar.

Heuristic Distribution Claimed by FunSearch Interpretable Principle Avg. Gain vs BestFit
c12 Uniform(20,100) Tight fit + skip mid-range bins Yes ~2.0%
c14 Weibull(3.0,45) Same as c12 with extra noise Barely ~3.3%
ab-FirstFit Generalized Simplified from c12 Yes ~2.1%
ab-WorstFit Generalized Simplified from c14 Yes ~3.3%

In essence, the “AI discoveries” could be replicated by a human in under an hour—without a supercomputer or a multimillion-token LLM.

Findings — From black-box magic to simple math

Herrmann and Pallez introduced ab-Baselines, a family of two-parameter heuristics that generalize the behavior of c12 and c14. These simple algorithms match or exceed the performance of the LLM-generated ones across multiple distributions—without any hallucinated complexity. Even more striking, the full tuning and testing process could be done in minutes on a laptop, compared to FunSearch’s multi-day, million-query ordeal.

The authors highlight two key behavioral hypotheses explaining the small performance bump:

  • (H1): The advantage emerges mainly when the dataset includes many items (large n), where the law of large numbers smooths randomness.
  • (H2): The lower bound of item sizes biases the packing process—bins close faster, improving efficiency.

Both effects are basic statistical properties, not deep algorithmic revelations.

Implications — What this means for AI research

The study delivers a sobering message: empirical optimization is not mathematical discovery. The LLM’s heuristics worked—but for trivial reasons and under trivial conditions. Worse, their “interpretability” was overstated: human-readable code isn’t the same as human-understandable reasoning.

This distinction matters. In an era where “AI-assisted discovery” is becoming a buzzword, researchers risk mistaking local curve-fitting for conceptual progress. Herrmann and Pallez argue that the true test of discovery is generalization: can the insight survive a change of context? For FunSearch’s bin packing case, the answer is no.

Criteria LLM-Evolved Heuristics Human-Designed ab-Baselines
Interpretability Low–Medium High
Computational cost Extremely high (millions of queries) Low (minutes on laptop)
Generalization Poor Moderate–Good
Conceptual novelty Minimal None claimed

In short, the LLMs didn’t discover anything. They merely rephrased an old idea—loudly.

Conclusion — Packing away the illusions

This episode underscores a critical point for the AI community: performance gains are not the same as discoveries, and code generation is not comprehension. FunSearch was a technological demonstration, not a mathematical revolution. What Herrmann and Pallez’s paper elegantly shows is that the frontier of AI-assisted science still lies in understanding, not automating.

The hype cycle will move on, but the lesson remains: if AI is to make genuine contributions to scientific reasoning, it must move from mimicry to mastery—from fitting bins to fitting theories.

Cognaptus: Automate the Present, Incubate the Future.