Evolve or Die Trying: When LLMs Stop Writing Code and Start Designing Algorithms

A developer asks an LLM to “write a better algorithm.” The LLM obliges. It writes code. The code runs, perhaps after a few rounds of apologetic debugging. The result is slightly better than the baseline, or at least sufficiently mysterious to be called “novel.” Everyone nods politely. Another benchmark table is born.

This is not a bad story. It is just a small one.

The more interesting question is what happens when the task is not “write one clever function,” but “design a working solver.” Not a helper routine inside a human-designed framework. Not a one-line scoring rule hiding inside a mature optimizer. A solver: structure, components, repair logic, parameter choices, and the awkward coordination among them. That is where many LLM-powered heuristic design systems start to look less like automated scientists and more like interns confidently rearranging office furniture during a fire drill.

The paper BEAM: Bi-level Memory-adaptive Algorithmic Evolution for LLM-Powered Heuristic Design argues that this failure is not mainly a model-size problem.¹ It is an architecture problem. Existing Language Hyper-Heuristic systems, or LHHs, often treat a candidate algorithm as one evolvable object. The LLM mutates it, reflects on it, rewrites it, and hopes performance feedback will somehow point to the right causal change. That can work when the target is a single function inside a fixed framework. It becomes much weaker when the LLM must design a complete or hybrid solver, because the search space stops being a neat coding exercise and becomes a layered design problem.

BEAM’s central move is simple enough to sound obvious after someone else has done the work: split algorithm design into two levels. Let one layer search for the high-level solver structure. Let another layer realize the missing functions inside that structure. Then give the system a memory of strong components so it does not repeatedly regenerate the same code like a very expensive goldfish.

That is the mechanism. The benchmark numbers matter, but they only become meaningful after the mechanism is clear.

The real bottleneck is not code generation; it is design granularity

The tempting misconception is that automatic heuristic design will improve mainly by using stronger LLMs, longer prompts, or more elaborate reflection. The paper pushes against that view.

In many earlier LHH settings, the LLM is asked to optimize a single function. For example, it may design a penalty heuristic inside Guided Local Search for the Traveling Salesman Problem, or a priority function for Bin Packing. These are useful tasks, but they are also sheltered tasks. The surrounding algorithmic structure is already supplied by human experts. The LLM is not designing the machine; it is swapping one gear.

That matters because the impressive part of many real optimization solvers is not one isolated formula. It is the coordination among initialization, search neighborhoods, perturbation, local improvement, stopping rules, evaluation functions, and parameters. If the framework is fixed, the LLM’s contribution can look larger than it really is. If the framework is weak, a small function improvement may produce a visible benchmark gain while still leaving the system far from a serious solver.

BEAM therefore treats heuristic design as a bi-level optimization problem. The exterior layer evolves algorithmic structures. The interior layer fills in and evaluates the functions required by those structures. This sounds like a technical rearrangement, but it changes the nature of the search:

Design question	Single-layer LHH behavior	BEAM behavior	Why it matters
What is the solver trying to be?	Usually entangled with the code being rewritten	Searched as a high-level structure by genetic evolution	Structure can change without being buried inside implementation details
How are missing components implemented?	Generated as part of one broad rewrite	Realized by MCTS in the interior layer	Component choices can be compared within the same structure
What happens when code fails?	Often handled through generic repair or discarded runs	Fixing is part of the function-realization process	Complex generated code becomes more usable rather than merely more verbose
How are numeric knobs tuned?	Often left to the LLM or prior settings	CMA-ES calibration is added after generation	The system separates code invention from parameter search
How is useful prior work reused?	Templates, retrieved text, or implicit model memory	HeuBase, KnoBase, and Adaptive Memory expose reusable components	The LLM composes known tools instead of pretending to rediscover them

The understated lesson is that “LLM as algorithm designer” is not one role. It is several roles: architect, component writer, repair assistant, tool caller, and parameter suggester. BEAM does not simply ask the model to be smarter. It narrows each job until the model has a job it can plausibly do.

BEAM separates the architect from the implementer

The exterior layer is the architect. It evolves solver structures through a genetic algorithm. A candidate individual is not merely a complete code blob; it is a structure with placeholders for functions. The LLM proposes the overall algorithmic skeleton, then crossover and mutation operate primarily at the structural level.

This is important because structural comparison is cleaner than full-code comparison. If two candidates differ in both architecture and every internal function, performance feedback becomes noisy. Did one solver win because the framework was better, because one function was lucky, because a hyperparameter landed well, or because the other solver quietly violated a constraint? In a monolithic setup, the answer is often: yes.

The interior layer is the implementer. Given a structure with unrealized functions, BEAM uses Monte Carlo Tree Search to generate and compare candidate implementations for those functions. The process does not merely fill every placeholder once and hope. It tries variants, completes the remaining functions for evaluation, repairs errors, and selects better realizations. Afterward, it calibrates hyperparameters using CMA-ES.

This division is the paper’s most useful design pattern. It says that agentic code generation should not collapse planning, coding, debugging, and tuning into one mystical prompt loop. Each of those activities produces a different kind of uncertainty. Treating them as separate search problems gives the system more places to learn from feedback.

There is a cost. BEAM needs more tokens before it produces its first usable individual, especially on harder tasks such as CVRP, where early generated code can fail. The paper’s evolution-curve discussion makes this explicit: BEAM has a slower start, but stronger improvement once the search begins to work. In business terms, this is not “cheap automation.” It is upfront design overhead in exchange for better search quality. A consultant would call that governance. An engineer would call it not being reckless.

Adaptive Memory turns generated functions into assets rather than exhaust fumes

The paper’s second important mechanism is Adaptive Memory.

Many LLM coding systems are oddly wasteful. They generate useful fragments, evaluate them, and then mostly move on. Some systems reduce cost by patching existing code rather than rewriting everything. BEAM chooses a different path: store strong generated functions as reusable callable components.

Adaptive Memory periodically extracts functions from elite individuals. It scores them using performance contribution, novelty, and utility. Similar functions are replaced only if the new one is better; low-utility entries are evicted when memory is full; stale functions can be pruned. The LLM sees function names and purpose statements, not necessarily full code, and can import these functions into future structures.

That makes the LLM less like a poet improvising from the void and more like a junior researcher with a decent lab notebook. This is, unfortunately, a major upgrade.

The memory mechanism is also tightly connected to the bi-level structure. If the system only evolves complete monolithic algorithms, there is no clean unit to store. BEAM’s separation creates reusable components naturally: the functions realized inside the interior layer can later become building blocks for other structures.

The ablation results support the point. When Adaptive Memory is removed, the variant BE still performs well but less reliably. On the ablation settings, BEAM improves over BE on TSP, CVRP, and CAF. The reported CVRP average/variance comparison is especially useful: BEAM records an average gap of 3.46 with variance 0.01, while BE records 4.41 with variance 0.19. The precise numbers are tied to the authors’ experimental setup, but the interpretation is straightforward: memory is not just a speed trick; it stabilizes evolution.

Knowledge Augmentation changes the evaluation question

The paper also introduces a Knowledge Augmentation pipeline, and this is where the article’s business relevance starts to become less speculative.

The authors argue that existing LHH evaluations often ask the wrong question. If an LHH modifies a small function inside a fixed human-designed solver, we learn something about function refinement. We learn much less about whether LLMs can design practical solvers. BEAM’s KA-guided evaluation tries to make the task more realistic by giving systems structured access to reusable knowledge and components.

The pipeline has two repositories:

Repository	What it contains	How it is used	Practical interpretation
HeuBase	Callable heuristic components, including LLM-retrieved libraries and pre-constructed routines unavailable through ordinary packages	The LLM can call functions by name and description	A tool library for algorithm construction
KnoBase	Text-based prior expert knowledge retrieved and summarized for the task	The LLM receives relevant domain knowledge in context	A research brief for the agent before it designs
Adaptive Memory	Strong functions generated during BEAM’s own search	Future structures can import them	A learning-by-doing component library

This is a meaningful shift. Instead of asking whether the LLM can hallucinate an algorithm from latent memory, the system asks whether it can compose, adapt, and improve known components under feedback. That is a more realistic question for enterprise AI too. Most businesses do not need an agent to invent operations research from scratch. They need it to assemble known methods, adapt them to constraints, test them, and keep reusable pieces.

The authors are also careful to position KA as complementary to single-function LHHs. A system good at refining individual functions can enrich HeuBase. A system like BEAM can then use those functions inside broader solver designs. That division of labor is more believable than the fantasy of one agent that does everything because the prompt has the word “expert” in it.

The main evidence: BEAM is strongest when the task becomes solver-like

The experimental design has several layers, and they should not be mixed together.

The first layer is traditional single-function evaluation. BEAM is tested on common tasks such as TSP penalty heuristics, Bin Packing priority functions, and cost-aware acquisition functions for Bayesian optimization. On these tasks, BEAM is competitive and often strong. For example, in TSP and BPP single-function settings, it records good minimum gaps across the reported datasets. In CAF, it outperforms AlphaEvolve in most tested datasets under the same evolution budget.

But this is not where BEAM’s architecture is most interesting. The authors themselves note that BEAM may introduce unnecessary complexity for simple single-function generation. In the appendix, they also observe that BEAM can overcomplicate simple objectives; for Bin Packing, its generated solution can be much longer than concise alternatives. This is not a fatal flaw. It is a reminder that using a bi-level algorithmic design system to write a small scoring function is like hiring a logistics team to move a chair.

The second layer is KA-guided complete or hybrid solver design. This is the better test of the paper’s thesis.

On Maximum Independent Set, BEAM combined with RLSA performs better than other LHH variants on the harder RB-800-1200 setting, reducing the reported gap to 3.05% compared with 4.04% for EoH, 4.35% for MCTS-AHD, and 5.03% for ReEvo in the same table. With KaHIP and ARW components, BEAM reports a slight negative gap on RB-800-1200 relative to the KaMIS baseline in the table, while remaining essentially tied on the easier and SATLIB settings. The paper phrases this as surpassing KaMIS in that setup; the practical reading should be narrower: under the authors’ evaluation configuration, BEAM found a strong way to combine powerful MIS components.

On CVRP, BEAM’s result is cleaner. Against Split & LS and other LHH-enhanced variants, BEAM reports lower gaps across CVRP-100, CVRP-200, and CVRP-500: 0.09%, 0.38%, and 0.86%, respectively. The baseline Split & LS gaps are 0.56%, 1.89%, and 3.80%; the best competing LHH rows remain higher. The authors summarize this as a 37.84% aggregate advancement across CVRP benchmarks. It still does not beat HGS, the strong genetic algorithm baseline set at 0.00% gap in the table, but it moves closer. That distinction matters. BEAM does not abolish specialized solvers. It helps LLM-driven systems build more competent hybrid solvers by composing and improving known components.

On TSP with EDM, BEAM also reports stronger gaps than the EACO-EDM baseline and EoH on larger instances, with TSP-500 at -10.66% compared with -8.53% for EoH. The authors note that both BEAM and EoH implement 2-opt, and that a robust ACO framework with 2-opt can outperform the EDM baseline. This is not merely “LLM magic”; it is component-aware algorithm engineering. Which is the point.

On BBOB, BEAM reports an average gap of 0.007 across Rastrigin, Rosenbrock, Sphere, Ackley, and Griewank, compared with 0.201 for the LLaMEA repository function tested by the authors, 0.957 for ReEvo, 6.032 for EoH, and 0.386 for LLaMEA-HPO. In the appendix, the generated BBOB solver is described as combining differential evolution, CMA-ES, and PSO in a staged framework. Again, the pattern is composition, not solitary inspiration.

Here is the evidence map in plain language:

Test group	Likely purpose	What it supports	What it does not prove
Traditional TSP/BPP/CAF single-function tests	Comparison with existing LHH norms	BEAM remains competitive even when the task is small	That BEAM is always the efficient choice for simple function design
KA-guided MIS/CVRP/TSP/BBOB tests	Main evidence for complete or hybrid solver design	Bi-level search plus reusable knowledge can improve complex solver construction	That BEAM dominates mature hand-engineered solvers in all domains
Best-individual distribution and evolution curves	Stability and search-process analysis	BEAM has stronger peak performance and better improvement ability, with slower startup	That token cost is negligible
Adaptive Memory ablation	Component memory test	Memory improves performance and stability	That any memory design would work equally well
MCTS vs One-Shot education	Interior-layer search test	MCTS usually improves function realization, except where the objective is simple	That MCTS is always worth its extra cost
KA tests and PMSP example	Knowledge-use exploration	HeuBase and KnoBase can help the LLM use libraries and prior algorithms	That automated knowledge retrieval is already mature or risk-free
Model generalization	Sensitivity to LLM choice	Smaller/cheaper models can still generate useful code under BEAM	That model quality no longer matters

The pattern is consistent: BEAM looks most valuable where structure, components, and reuse matter. It is less compelling as a generic replacement for lightweight code evolution.

The generated algorithms reveal what the numbers hide

The appendix is useful because it describes the kinds of algorithms BEAM generates. These descriptions are not just decorative; they explain why the performance table is plausible.

For CVRP, the best generated algorithm combines four initialization strategies, adaptive perturbation, multiple mutation strategies, and periodic intensification. More interestingly, the authors note that the exterior layer initially planned only two mutation strategies, while the interior layer expanded the strategy set with shift and scramble operations. This is exactly the kind of behavior the architecture is supposed to enable: the high-level design says what role a component should play; the interior search makes that role more concrete and sometimes richer.

For MIS with KaHIP and ARW, the generated algorithm uses KaHIP during initialization by isolating populations per partition and allowing occasional inter-block exchange. The authors then manually test a KaMIS-like idea—using KaHIP in crossover—and report worse performance. This is a small but valuable example of algorithmic discovery: BEAM did not merely imitate the famous solver’s component placement. It found a different placement that worked better in the tested configuration.

For BBOB, the generated solver stages differential evolution, CMA-ES, and PSO. That is not a mysterious alien algorithm. It is a sensible composition of known optimization methods. The achievement is not that the LLM invented evolutionary computation during lunch. The achievement is that the framework helped it assemble, allocate budget among, and tune these methods into a working solver.

This matters for business readers because most enterprise automation has the same shape. The highest-value systems rarely require an AI to invent a new discipline. They require it to choose among known tools, sequence them, handle failures, and reuse working components. That is less glamorous than “AI scientist.” It is also much closer to something companies can deploy without embarrassing themselves in front of their own logs.

The business lesson is architecture before autonomy

The business relevance of BEAM is not that every company should start evolving CVRP heuristics with LLMs. Most companies are not waking up at 3 a.m. worried that their Guided Local Search penalty function lacks emotional depth.

The broader lesson is about agent design.

Many AI automation projects still treat the LLM as the central worker: give it a task, add tools, request output, maybe loop with feedback. BEAM suggests a more disciplined design principle: the LLM should sit inside a search architecture that decides what kind of decision is being made.

For business AI systems, this translates into four practical rules.

First, separate workflow structure from component execution. A claims-processing agent, for example, should not improvise both the whole review workflow and every document extraction step in one response. Let one layer propose or select the workflow. Let another layer implement, test, and repair components.

Second, turn successful components into reusable assets. If an agent finds a reliable extraction routine, routing rule, reconciliation method, or prompt-template/tool combination, it should not vanish after the run. Store it with a name, interface, usage description, and performance record. The company’s AI system should get less forgetful over time. This is not a high bar, but apparently civilization needed reminding.

Third, give agents structured external knowledge instead of relying on latent model memory. The HeuBase/KnoBase distinction maps naturally to enterprise work: callable tools and procedures on one side; text-based policy, domain guidance, and prior cases on the other. Confusing these two is a common design error. A policy paragraph is not a callable function. A callable function is not an explanation.

Fourth, evaluate under realistic task budgets. BEAM’s authors explicitly control time, token, and trial budgets across experiments. In enterprise settings, the analog is workflow latency, API cost, human review burden, failure-recovery time, and measurable business outcome. “The demo worked once” is not an evaluation method. It is a sales ritual.

A useful business mapping looks like this:

BEAM concept	Enterprise AI analogue	ROI relevance	Boundary
Exterior layer	Workflow or process architecture search	Better process design, not just better answers	Requires measurable objectives and safe candidates
Interior layer	Component implementation and testing	More reliable tool calls, extraction modules, and decision rules	Needs execution sandbox and validation data
Fixing	Automated error repair and constraint checking	Reduces failed runs and manual debugging	Cannot replace domain compliance review
CMA-ES calibration	Parameter tuning after design	Avoids relying on LLM guesses for thresholds	Works only when metrics are computable
Adaptive Memory	Reusable internal component library	Compounds learning across runs	Requires versioning and quality control
HeuBase	Curated callable toolbase	Prevents reinvention and standardizes operations	Tool descriptions must be accurate
KnoBase	Retrieved and summarized domain knowledge	Improves task grounding	Retrieval errors can mislead the system

The uncomfortable implication is that many “agentic workflow” products are under-architected. They add tools to a chatbot and call the result autonomy. BEAM’s lesson is less romantic: autonomy improves when the search space is constrained, decomposed, evaluated, and remembered.

What remains uncertain before this becomes a general business pattern

The paper’s evidence is strongest in algorithmic optimization benchmarks. That is both a strength and a boundary.

Optimization benchmarks offer objective feedback. A CVRP solver returns a route cost. A MIS solver returns an independent set size. A BBOB optimizer can be scored against known functions. This makes evolutionary search feasible. Many business processes do not have such clean objective functions. “Better customer support resolution,” “more compliant loan review,” or “stronger market insight” may require delayed, noisy, or human-labeled evaluation.

The second boundary is execution safety. BEAM generates and runs code. That is acceptable in controlled research settings with sandboxed benchmarks. Enterprise systems need stricter controls: dependency management, data access restrictions, audit logs, rollback, cost limits, and red-team tests. A system that evolves code can also evolve new ways to break assumptions. Evolution is charming in nature. In production, it is an incident report waiting for a name.

The third boundary is knowledge quality. HeuBase and KnoBase are only as good as the components and summaries they contain. The paper’s KA pipeline includes LLM-retrieved libraries and expert knowledge, but business deployments would need curation. An agent that composes from a weak internal library will become very efficient at reusing weak ideas.

The fourth boundary is cost. BEAM’s bi-level design uses more tokens and compute before producing its first strong candidate. The paper shows stronger improvement ability and stability, but the cost-benefit equation depends on task value. For high-stakes optimization, that overhead can be justified. For routine content formatting, it would be theatrical.

These boundaries do not weaken the paper. They clarify where its lesson travels.

The paper is really about making LLM search less childish

The phrase “LLM-powered heuristic design” can sound like another entry in the endless catalog of “LLM does X.” But BEAM is more interesting because it treats the LLM as one part of an engineered search system.

The paper does not say: ask a bigger model and algorithms will appear.

It says: decompose the design problem, search structures separately from functions, repair and calibrate generated code, store reusable components, provide callable and textual knowledge, and evaluate against realistic solver tasks. That is a less magical claim. It is also a much more useful one.

For business AI, the message is similar. The next stage of automation will not come from letting chatbots ramble more confidently through tool menus. It will come from systems that know the difference between designing a process, implementing a component, tuning a parameter, reusing prior work, and measuring whether any of it helped.

BEAM is not a universal recipe. It is a strong pattern: architecture before autonomy, memory before repetition, evaluation before applause.

Not as catchy as “AI writes code.” Much less embarrassing when the code actually has to work.

Cognaptus: Automate the Present, Incubate the Future.

Chuyang Xiang, Yichen Wei, Jiale Ma, Handing Wang, and Junchi Yan, “BEAM: Bi-level Memory-adaptive Algorithmic Evolution for LLM-Powered Heuristic Design,” arXiv:2604.12898v1, 14 Apr. 2026, https://arxiv.org/abs/2604.12898. ↩︎

The real bottleneck is not code generation; it is design granularity#

BEAM separates the architect from the implementer#

Adaptive Memory turns generated functions into assets rather than exhaust fumes#

Knowledge Augmentation changes the evaluation question#

The main evidence: BEAM is strongest when the task becomes solver-like#

The generated algorithms reveal what the numbers hide#

The business lesson is architecture before autonomy#

What remains uncertain before this becomes a general business pattern#

The paper is really about making LLM search less childish#