DeltaEvolve: When Evolution Learns Its Own Momentum

Memory is usually where agentic systems go to become expensive.

That is not the glamorous failure mode. It is not the cinematic robot rebellion, nor the slightly more realistic spreadsheet full of hallucinated invoices. It is quieter: an LLM agent keeps improving a program, stores previous attempts, retrieves a few “good” ones, and then spends half its context window rereading code scaffolding that no longer explains anything useful.

The paper behind DeltaEvolve starts from that very unromantic bottleneck.¹ LLM-driven program evolution already works well enough to be interesting. Systems inspired by FunSearch and AlphaEvolve can generate candidate programs, run them through evaluators, keep the better ones, and produce improved variants. The puzzle is no longer whether iteration can help. The puzzle is what the system should remember between iterations.

The default answer has been: remember high-performing programs. Store full code. Include top solutions and diverse solutions in the next prompt. Let the model infer what is useful.

DeltaEvolve says this is the wrong unit of memory. A successful program is not the same thing as an explanation of success. Full code tells the model where the search landed; it does not tell the model which move actually improved the trajectory. That distinction is the paper’s real contribution. The benchmark gains matter, but the mechanism matters more.

The agent is not learning weights; it is learning through context

DeltaEvolve formalizes LLM-driven program evolution as a kind of Expectation–Maximization process. That sounds heavier than the practical idea, so strip it down.

The system is trying to find a program $p$ that scores well under an evaluator $R_q$ for task $q$:

$$ p^\ast = \arg\max_{p \in \mathcal{P}} R_q(p) $$

The evaluator may measure accuracy, runtime, residual error, packing quality, symbolic-regression fit, or another task-specific objective. The search space is not a neat continuous vector space. It is program space: variable-length code, control flow, heuristics, subroutines, constraints, and sometimes programs that themselves perform optimization. A lovely place for gradients to go and die politely.

So the agent alternates between two steps:

Step	What happens	What it means in an LLM agent
E-step	Sample candidate programs	The LLM generates code conditioned on the current task and context
M-step	Update the latent state	The system chooses what historical information enters the next context

Because the paper assumes fixed model weights and black-box access to frontier LLMs, the model itself is not being trained. There is no parameter update hidden behind the curtain. The mutable object is the context.

That makes the M-step the real learning mechanism:

$$ C_{t+1} = \pi(H_{t+1}) $$

Here, $H_{t+1}$ is the accumulated history of generated programs and scores, while $\pi$ is the policy for selecting and formatting that history into the next context.

This is the part many agent designs quietly under-theorize. They make the model smarter, the evaluator stricter, or the sampling budget larger, while treating memory construction as a filing problem. DeltaEvolve’s argument is sharper: if the context is the only mutable latent variable, then context construction is not bookkeeping. It is the learning algorithm.

Scalar scores are weaker than selected context

A common reader instinct is to imagine that evolutionary coding agents improve because they see numerical rewards. Program A scored 0.81, program B scored 0.93, therefore the model learns the direction of improvement. Nice story. Slightly too clean.

The paper tests this directly with an ablation on five domains using an AlphaEvolve-style framework. It compares three settings:

Setting	What is preserved	What is removed	Likely purpose of the test
AlphaEvolve standard	Top-selected programs and explicit scores	Nothing	Baseline for full-code evolutionary memory
Blind-Elite	Top-selected programs	Numerical scores in the prompt	Ablation: tests whether scalar values themselves drive improvement
Random-Context	Explicit scores	Structured top-selection policy	Ablation: tests whether scores help without good context selection

The result is not subtle. Blind-Elite remains close to AlphaEvolve. Random-Context collapses.

On black-box optimization, AlphaEvolve scores 2.642, Blind-Elite scores 2.578, and Random-Context falls to 1.429. On hexagon packing, the scores are 0.972, 0.970, and 0.786. On efficient convolution, Blind-Elite even slightly exceeds the standard setting, 0.911 versus 0.897, while Random-Context drops to 0.550.

The interpretation is important: scalar feedback is useful for selecting what enters memory, but weak as direct conditioning material. A score says “this worked.” It does not say which part worked, why it worked, or where that idea should be reused.

That is the bridge from AlphaEvolve-style full-code memory to DeltaEvolve’s semantic deltas.

Full-code memory asks the model to rediscover causal structure from artifacts. Delta memory writes down the causal structure before the next generation asks for it. An agent forced to reread full code is like a manager reviewing every version of a financial model to understand why revenue sensitivity improved. Technically possible. Also a nice way to punish everyone involved.

Semantic deltas turn history into direction

DeltaEvolve replaces historical full-code snapshots with semantic deltas:

$$ \delta_i = \mathrm{Diff}(p_i, p_{i-1}) $$

This is not a textual diff in the Git sense. It is a structured description of the logic change between a parent program and its offspring. The paper’s key move is to treat programs as compositional objects. A program is not one indivisible blob. It is a combination of components: initialization logic, constraint handling, search strategy, step-size adaptation, numerical solver structure, boundary treatment, caching, vectorization, and so on.

If programs are compositional, then improvement often comes from changes to components rather than from the whole program. A better candidate may not be useful because the entire source file deserves worship. It may be useful because it replaced random initialization with Latin Hypercube sampling, added adaptive local search, changed a boundary handler, introduced a Metropolis acceptance rule, or shifted convolution from spatial computation to FFT-based computation.

DeltaEvolve stores those changes explicitly.

Its delta context has the form:

$$ C_{\Delta} = {p_{\text{parent}} \oplus (\delta_1, \Delta R_1), \dots, (\delta_n, \Delta R_n)} $$

The $\Delta R$ is qualitative rather than a precise numerical gain: improved, degraded, or similar. This choice follows from the earlier ablation. Exact reward magnitudes are not the main carrier of useful guidance. The mechanism of change is.

The paper calls this “momentum” by analogy to optimization. In gradient-based optimization, the direction of prior movement helps guide future updates. In DeltaEvolve, the accumulated semantic differences between programs form a discrete, natural-language analogue of a momentum vector. It tells the model not only which solutions were good, but which kinds of modifications have been pushing the search in a productive direction.

This is the central mechanism. Not “more memory.” Better-shaped memory.

The three-level database is the operational trick, not a decorative architecture diagram

DeltaEvolve does not merely say “summarize history.” That would be cheaper, and also much less useful. The paper uses a multi-level database, where each evolutionary node contains three representations:

Level	Stored content	Function
Level 1	Delta Summary	A compact FROM/TO description of the strategic shift, usually around 20–40 tokens
Level 2	Delta Plan Details	Structured modifications: old logic, new logic, and hypothesis for each changed component
Level 3	Full Code	The executable program, retained for the current parent and evaluation

This hierarchy matters because different historical nodes need different levels of detail.

Old elite nodes may still carry useful strategic signals, but including their full code wastes context. Recent or highly relevant nodes may deserve more detailed delta plans because the model needs to see concrete logic changes and hypotheses. The current parent still requires full code, because the model must modify executable source rather than edit a motivational poster.

DeltaEvolve’s progressive disclosure sampler therefore performs two jobs:

Node selection: choose a parent, top-performing elite nodes, and diverse inspiration nodes.
Multi-level rendering: decide whether each selected node enters the prompt as Level 1 summary, Level 2 plan, or Level 3 code.

This is the paper’s practical engineering insight. The memory system is not simply compressed. It is rendered at different resolutions depending on how the next mutation will use it.

That is a useful distinction for business agents. Compression asks, “How can we fit more into context?” Progressive disclosure asks, “What level of detail does this next decision actually need?” The second question is more operationally intelligent. It also tends to be cheaper, which is convenient because finance departments continue to exist.

The main results support both quality and cost claims

The paper evaluates DeltaEvolve on five program-discovery domains:

black-box optimization over benchmark functions;
packing 11 unit regular hexagons into a minimal outer hexagon;
symbolic regression for nonlinear oscillator dynamics;
PDE solver discovery for sparse systems;
efficient 2D convolution under correctness and runtime constraints.

The baselines include Parallel Sampling, Greedy Refine, and AlphaEvolve-style full-code evolutionary search. The main comparison of interest is AlphaEvolve versus DeltaEvolve, because that isolates the memory representation issue most directly.

The experiments use two model-family pairings: GPT-5-mini with o3-mini, and Gemini-2.5-flash-lite with Gemini-2.5-flash. The paper reports best score and cumulative token consumption. It runs methods with three random seeds and reports maximum best score plus average token use.

Here is the condensed comparison against the full-code AlphaEvolve baseline:

Task	GPT-family score: AlphaEvolve → DeltaEvolve	Gemini-family score: AlphaEvolve → DeltaEvolve	Token reduction vs AlphaEvolve
Black-box optimization	2.6415 → 2.7297	2.5221 → 3.9372	24.9% / 35.2%
Hexagon packing	0.9721 → 0.9821	0.7859 → 0.8804	8.5% / 33.0%
Symbolic regression	3.2657 → 3.4174	3.2174 → 3.2198	51.2% / 46.1%
PDE solver	0.8850 → 0.8915	0.9901 → 0.9931	20.9% / 57.4%
Efficient convolution	0.8974 → 0.9067	0.8219 → 0.9032	49.0% / 41.6%

The average token reduction is approximately 36.79%. That number is not a rounding error dressed up as architecture. It comes from replacing full-code histories with structured deltas and revealing full code only where the next edit requires it.

The score improvements require more careful interpretation. In some tasks, the gains are large: black-box optimization under the Gemini-family pairing jumps from 2.5221 to 3.9372. In other tasks, the improvement is modest: symbolic regression under the Gemini-family pairing moves from 3.2174 to 3.2198. The paper correctly notes that when tasks approach strong baselines or known optima, small score gains can still be meaningful.

Still, the strongest business-relevant result is not “DeltaEvolve wins every row.” It is that DeltaEvolve usually improves or preserves solution quality while materially reducing the token budget. In production agents, those two metrics rarely travel together. Better outputs usually cost more. DeltaEvolve’s claim is that better memory representation can move the frontier.

The appendix case study shows what “momentum” looks like in practice

The appendix is not a second thesis. It is where the mechanism becomes visible.

In the black-box optimization case study, the paper annotates the evolutionary trajectory with Level 1 delta summaries. The early generations explore basic heuristics such as local Gaussian perturbations. A key shift appears around iteration 14: the agent moves toward Latin Hypercube initialization with adaptive batched local search, producing a steep performance gain. Later changes add stagnation probing and Metropolis acceptance to improve budget use and escape local optima.

This is exactly the kind of history that full-code memory hides badly. If a later candidate needs inspiration, the useful lesson is not the entire earlier source file. The useful lesson is something like:

random sampling was replaced with Latin Hypercube initialization;
local search became adaptive and batched;
stagnation triggered probing rather than passive continuation;
acceptance logic became less greedily trapped.

The Level 2 delta plan makes this even more explicit. The paper shows a structured log with components such as Initialization, Polishing, and Step-size. Each component contrasts OLD_LOGIC and NEW_LOGIC and includes a hypothesis explaining why the change should help.

That is the right shape of memory for iterative agents. It is not only machine-readable. It is human-auditable. A developer can inspect why the agent changed direction. A future agent step can reuse the same logic without parsing hundreds of lines of implementation scaffolding. Interpretability and efficiency are not enemies here. They are, for once, sharing an elevator without drama.

The evidence map: what each result does and does not prove

DeltaEvolve includes several kinds of evidence. Mixing them together would make the paper sound stronger but less precise. Better to keep the roles separate.

Evidence	Likely purpose	What it supports	What it does not prove
Feedback ablation: Blind-Elite vs Random-Context	Ablation	Context selection matters more than exposing scalar scores directly	It does not prove semantic deltas are always the best possible memory format
Main five-domain benchmark table	Main evidence and comparison with prior full-code memory	DeltaEvolve improves or preserves best scores while reducing token consumption across tested domains	It does not guarantee gains on tasks without reliable automatic evaluators
Progressive disclosure design	Implementation detail	Multi-resolution memory can reduce context waste while preserving executable editability	It does not isolate every design component’s individual contribution
Appendix case study	Exploratory illustration and mechanism demonstration	Delta summaries can reveal coherent trajectories of algorithmic improvement	It is one trajectory, not standalone statistical proof
Prompt templates and configuration	Reproducibility and implementation detail	The system relies on strict delta logging, parsing, and controlled prompt structure	It does not remove dependency on LLM compliance or evaluator quality

This map matters because the paper’s business implication depends less on the raw benchmark table and more on the combined mechanism: selected context beats raw scores; semantic deltas expose reusable changes; progressive disclosure reduces cost; automatic evaluators close the loop.

Remove any one of those and the story weakens.

The business value is structured operational memory, not merely cheaper prompts

For business readers, the easy takeaway is “use summaries instead of full context to save tokens.” That is too shallow.

The better lesson is that iterative agents need memory objects designed around improvement mechanisms. A business process agent that revises SQL queries, forecasting scripts, reconciliation rules, pricing logic, customer-support workflows, or marketing experiments should not store only the final artifact and its score. It should store the change that moved the metric.

A DeltaEvolve-inspired business log would look less like this:

Version 17 achieved 94.2% validation accuracy. Full script attached.

And more like this:

Field	Example content
Old logic	Forecast used global seasonality and a fixed promotion coefficient
New logic	Added category-level seasonal factors and separated discount depth from promotion flag
Hypothesis	Category seasonality explains demand variance that global seasonality averages away
Outcome	Improved validation error; watch for overfitting in sparse categories
Reuse condition	Apply to categories with enough historical observations and stable SKU mapping

That format is useful to both machines and managers. The next agent run can retrieve the logic. A human reviewer can audit whether the change is plausible. A team can build institutional knowledge from experiments instead of letting every iteration dissolve into a pile of artifacts.

This is where DeltaEvolve becomes relevant beyond scientific discovery. Many business AI systems already face the same structural problem: repeated attempts, partial feedback, growing histories, and a context budget that refuses to become infinite merely because a roadmap says “agentic.”

The practical implication is simple:

Technical contribution	Operational consequence	ROI relevance
EM framing of program evolution	Treat context construction as the learning loop	Invest in memory policy, not just larger models
Semantic deltas	Store what changed and why, not just final artifacts	Reduce repeated failures and make improvement reusable
Multi-level database	Separate summaries, detailed plans, and executable artifacts	Lower token cost without deleting useful history
Progressive disclosure	Retrieve detail only when needed	Control latency and cost in long-running workflows
Qualitative performance shifts	Avoid overconditioning on noisy exact scores	Make memory robust when metrics are imperfect but directional

This is not a call to sprinkle “delta memory” into every chatbot and declare victory. Please do not do that. The paper’s setting is narrower and cleaner than most business environments. The tasks have executable programs and automatic evaluators. The system can run candidates, score them, and keep iterating.

That is a very different environment from open-ended strategy discussion, policy judgment, or high-stakes decisions where evaluation is delayed, contested, political, or partly qualitative.

Where the result applies—and where it probably does not

DeltaEvolve is strongest when five conditions hold.

First, the output is executable or testable. Code, solvers, optimization routines, workflow rules, data transformations, and analytical scripts are natural candidates.

Second, evaluation is automatic enough to support iteration. The paper’s tasks can score candidates programmatically. Business analogues include unit tests, benchmark datasets, backtests, validation error, runtime, reconciliation accuracy, extraction F1, or SLA compliance.

Third, improvements are decomposable. The method assumes programs contain reusable components. If the output is a single indivisible judgment, semantic deltas have less structure to exploit.

Fourth, history is long enough for memory design to matter. DeltaEvolve is aimed at iterative discovery, not one-shot generation.

Fifth, logs must be disciplined. The delta summary and delta plan are not casual notes. The paper uses strict output formats, delimiters, and parsing rules. A vague “improved the algorithm” summary is not a delta. It is a souvenir.

The uncertain part is generalization to messy enterprise decisions. Business agents often operate with delayed metrics, human preferences, ambiguous constraints, changing objectives, and feedback contaminated by politics. In those settings, semantic deltas may still help, but the evaluator problem becomes central. If the score is bad, delayed, or gameable, memory will faithfully preserve the wrong lessons. Very efficient nonsense remains nonsense. It just invoices less.

What to borrow now

The most useful near-term borrowing from DeltaEvolve is not the entire research system. It is the memory schema.

For any iterative AI workflow, ask four questions after each meaningful revision:

What was the old logic?
What changed?
Why was the change expected to help?
What happened under evaluation?

Then store the answer at multiple levels:

Memory layer	Business analogue
Delta summary	One-line FROM/TO change for quick retrieval
Delta plan	Component-level old/new/hypothesis/outcome record
Full artifact	Script, prompt, workflow, dashboard, or policy version

This turns memory from an archive into an improvement ledger. It also gives teams something rare: a way to inspect the reasoning trajectory of an automated workflow without replaying every artifact from the beginning.

For Cognaptus-style automation, that distinction matters. A business process does not become intelligent because an LLM touches it. It becomes more adaptive when each run leaves behind reusable operational knowledge. DeltaEvolve’s paper shows one technically serious way to do that in program discovery.

The broader lesson is almost embarrassingly practical: if an agent is going to evolve, it should remember movement, not clutter.

Conclusion: the useful past is the part that changed the future

DeltaEvolve’s best idea is not that semantic deltas save tokens, although they do. It is that evolutionary agents need a memory representation aligned with how improvement actually happens.

Full programs are artifacts. Scores are outcomes. Deltas are movement.

That movement is what gives the system something like momentum: a reusable direction of search, expressed not as hidden weights but as structured, inspectable context. The paper’s experiments show that this can improve best-achieved solutions while cutting token use by roughly 36.79% on average against full-code evolutionary baselines. The appendix then shows what the mechanism looks like when the agent shifts from random search to structured initialization, adaptive local search, stagnation probing, and acceptance strategies.

For businesses, the lesson is not to copy the scientific benchmark setup blindly. The lesson is to stop treating agent memory as a pile of prior outputs. In any workflow where AI systems iterate, test, revise, and improve, the most valuable memory is often the smallest honest record of what changed and why.

Evolution does not need a museum. It needs a trail.

Cognaptus: Automate the Present, Incubate the Future.

Jiachen Jiang, Tianyu Ding, and Zhihui Zhu, “DeltaEvolve: Accelerating Scientific Discovery through Momentum-Driven Evolution,” arXiv:2602.02919, 2026. https://arxiv.org/abs/2602.02919 ↩︎

The agent is not learning weights; it is learning through context#

Scalar scores are weaker than selected context#

Semantic deltas turn history into direction#

The three-level database is the operational trick, not a decorative architecture diagram#

The main results support both quality and cost claims#

The appendix case study shows what “momentum” looks like in practice#

The evidence map: what each result does and does not prove#

The business value is structured operational memory, not merely cheaper prompts#

Where the result applies—and where it probably does not#

What to borrow now#

Conclusion: the useful past is the part that changed the future#