Orders are where trading systems stop sounding intelligent and start spending money.

A model can narrate the market beautifully. It can explain momentum, liquidity, volatility regimes, inventory pressure, and the great moral tragedy of being early. None of that matters if the final system places the wrong limit order, sizes too aggressively, fills only in a fantasy simulator, or wins a backtest because it tried enough variants to accidentally find one that looked divine.

That is why the most interesting part of MadEvolve: Evolutionary Optimization of Trading Systems with Large Language Models is not that an LLM-driven system improves a Bitcoin trading backtest.1 That is the headline, and yes, the numbers are large enough to attract the usual crowd of “AI trading bot” prophets with affiliate links and suspicious screenshots.

The more useful lesson is narrower and more operational: LLM agents become valuable when they are placed inside a controlled optimization factory. The paper’s system does not ask a model to “trade BTC.” It fixes the simulator, fixes the data splits, fixes the accounting logic, exposes only selected blocks of trading code to mutation, evaluates each candidate mechanically, and keeps out-of-sample testing separate from the search objective.

That distinction matters. A trading desk does not need another eloquent intern with a Bloomberg terminal and a temperature setting. It needs a way to generate candidate algorithms, reject most of them, preserve the few that survive realistic scoring, and know when the process has started mining the backtest instead of the market.

The paper is a proof-of-concept, not a live-trading recipe. But as a design pattern for agentic business optimization, it is unusually instructive.

The paper is not asking whether an LLM can predict Bitcoin

The lazy version of this article would ask: “Can AI beat the Bitcoin market?”

That is not the right question. The paper studies whether an LLM-driven evolutionary code optimizer can improve different components of a trading system when those components are evaluated under a fixed backtest harness.

The authors use BTCUSD one-minute OHLCV bars, split chronologically into:

Split Period Role
Train 2022–2023 Fitting the alpha model
Validation 2024 Fitness target for evolution
Test 2025-01-01 to 2025-10-10 Held-out out-of-sample evaluation

The trading setup is a passive limit-order strategy. Each minute, the strategy proposes a signed trade size and a limit price. The simulator checks whether the order would have filled inside the realized candle range, updates inventory, cancels the prior order, and submits the next one. The main optimization metric is impact-adjusted PnL, which subtracts trading fees and a market-impact penalty.

This is already a useful boundary. The paper is not claiming that the system would work unchanged on Binance, OKX, Coinbase, Hyperliquid, or the private exchange in someone’s Telegram imagination. The data are aggregated across venues. The fill model is candle-based. Market impact is modeled, not observed from actual execution. The authors explicitly state that the quantitative results should not be expected to transfer “out of the box” to a real exchange.

Good. That caveat is not decorative. It defines the problem.

Within that controlled world, the question becomes: if the evaluation environment is fixed and the mutable code region is constrained, can LLM-driven evolution discover better trading-system logic?

The real mechanism is constrained code evolution

MadEvolve is an LLM-driven evolutionary optimization framework inspired by systems such as FunSearch and AlphaEvolve. In this paper, it is adapted to trading.

The loop is simple in outline:

  1. Select an existing program from a structured population.
  2. Retrieve useful “inspiration” programs from the archive.
  3. Ask an LLM to mutate the editable code region.
  4. Evaluate the resulting candidate using the fixed backtest or forecasting metric.
  5. Insert the scored candidate back into the population.

The important word is editable. The framework uses explicit EVOLVE-BLOCK markers. Code outside those blocks—the simulator, data loading, PnL accounting, and evaluation logic—stays fixed. This design prevents the model from improving the score by quietly rewriting the exam.

That is the first business lesson. Agentic optimization needs a hard separation between the thing being optimized and the thing doing the judging. Without that boundary, the agent is not optimizing a business process. It is negotiating with the scoreboard. We already have consultants for that.

MadEvolve also uses population management rather than a single linear prompt chain. It keeps structurally diverse candidates through a MAP-Elites-style grid, uses island populations with migration, and preserves elite solutions. This matters because trading-system design is not a smooth hill-climb. A mutation can improve sizing while damaging fill probability, improve hit rate while destroying risk-adjusted return, or improve PnL by increasing volume until impact costs start collecting rent.

The system also uses an ensemble of models rather than one model. In the reported runs, mutation requests are routed across several frontier and efficient models. The model comparison is not the central result, and the specific rankings will age quickly, but the pattern is useful: no single model dominates every best-solution lineage. Diversity in mutation style matters more than worshipping the model leaderboard of the week. The leaderboard, naturally, will have recovered by Monday.

The experiment ladder separates easier feedback from noisier feedback

The paper’s five runs are best understood as a ladder of increasing difficulty and noisiness.

Run What evolves Fitness signal Likely purpose
Run 1 Target-position logic Impact-adjusted PnL Main evidence for sizing and position-construction improvement
Run 2 Order-placement logic Impact-adjusted PnL Main evidence for execution optimization
Run 3 Target and order jointly Impact-adjusted PnL Main evidence on joint strategy search
Run 4 Forecast feature set only Forecast-quality composite Main evidence for feature engineering, not trading PnL
Run 5 Feature set and full strategy jointly Impact-adjusted PnL Main evidence for end-to-end co-design

This ladder is the paper’s strongest organizing device. It prevents the results from collapsing into one vague claim that “AI improved trading.” The system improves different things under different reward structures.

Execution optimization is comparatively crisp. If the forecast is held fixed and only order placement changes, PnL feedback is direct. A better execution policy can improve fill quality, reduce adverse selection, respond to inventory, and align order depth with signal strength. The reward signal is still noisy because markets are noisy, but it is less indirect than asking the system to invent predictive features from minute bars.

Forecasting is harder. A feature set can improve information coefficient without improving trading PnL because the executor may be calibrated to the old signal scale. Better alpha does not automatically become better money. It first has to pass through sizing, thresholds, risk controls, and execution.

Joint optimization is the most ambitious version: let features and execution co-adapt. That can capture interactions missed by separate tuning, but it also expands the search space and increases overfitting pressure.

This is why the paper deserves a mechanism-first reading. The result is not one number. It is the relationship between mutable scope, reward quality, search difficulty, and generalization.

The headline results are large, but the pattern matters more than the multiplier

The four PnL-optimizing runs all improve out-of-sample performance relative to the shared baseline. The strongest absolute test PnL comes from order-placement evolution. The strongest test Sharpe comes from joint feature-plus-strategy evolution.

Run Validation Sharpe Validation impact-adjusted PnL Test Sharpe Test impact-adjusted PnL Interpretation
Baseline 4.81 $83K 3.82 $47K Shared baseline
Run 1: target only 4.83 $533K 4.45 $271K Larger PnL, modest Sharpe gain
Run 2: order only 6.49 $2.238M 5.12 $1.205M Largest absolute OOS PnL
Run 3: joint strategy 6.51 $973K 5.11 $473K Strong, but not better than Run 2 in PnL
Run 5: joint features + strategy 8.85 $1.855M 5.65 $724K Highest OOS Sharpe

The obvious temptation is to stare at the multipliers. Run 2 takes test impact-adjusted PnL from $47K to $1.205M. Run 5 takes test Sharpe from 3.82 to 5.65. In finance, large multipliers have a way of turning otherwise normal adults into PowerPoint poets.

But the more interesting pattern is comparative.

Run 2, which evolves only order placement while keeping the target computation fixed, delivers the largest absolute out-of-sample PnL. That suggests that under this simulator and cost model, execution logic is a high-leverage optimization target. The system discovers better ways to price passive orders, condition depth on signal quality, and handle adverse-selection risk.

Run 5, which evolves both features and execution, delivers the highest out-of-sample Sharpe. It does not beat Run 2 in absolute PnL because it trades less volume and runs a smaller book. In business terms, Run 2 looks like the throughput winner; Run 5 looks like the risk-adjusted quality winner.

That distinction is important. A firm optimizing a market-making system, a treasury execution engine, or a portfolio rebalancer may care more about execution throughput. A firm optimizing capital allocation under risk limits may care more about Sharpe-like quality. “Best” is not a universal adjective. It is a governance decision pretending to be a metric.

The system did not merely scale up trade size

A standard suspicion with PnL-optimized trading systems is that the optimizer simply learned to trade larger. If the baseline makes money, scaling the position can make PnL look better until risk or impact catches up.

The paper treats this as a serious problem rather than hiding it behind a larger font. It uses three checks.

First, it compares evolved strategies against a sizing-only counterfactual: what would the baseline have earned if its trades were uniformly scaled to match the evolved volume? The evolved strategies beat this counterfactual, meaning pure size increase does not explain the results.

Second, it reports Sharpe improvements. Linear scaling does not improve Sharpe. If all trades are simply multiplied by the same constant, both return and volatility scale together. With super-linear impact costs, blind scaling should make Sharpe worse, not better.

Third, it reports Calmar improvements. Calmar ratio also resists the “we just took more risk” explanation because it compares annualized return with maximum drawdown.

The qualitative code changes reinforce the point. The evolved strategies introduce stateful signal processing, inventory-aware no-trade bands, conviction-dependent sizing, regime-conditional risk reduction, and adaptive quoting depth. These are structural changes, not one global multiplier wearing a lab coat.

For business readers, this is a useful diagnostic template. Any agentic optimization system that improves an operational metric should be tested against boring counterfactuals. Did it improve the process, or did it just increase intensity? Did the sales agent convert better, or did it send five times as many messages? Did the warehouse optimizer reduce cost, or did it defer work into a hidden queue? Did the trading bot improve execution, or did it just lean harder into risk?

The cheap explanation should be killed before the expensive one is celebrated.

Forecast features improve, but alpha still needs an executor that understands it

Run 4 evolves the feature set feeding a ridge regression alpha model. This run is not scored by trading PnL. It optimizes a composite of prediction-quality metrics: out-of-sample $R^2$ at a 10-minute horizon, mean daily information coefficient, and ICIR.

The evolved feature set expands the baseline from three EMA-return features to 77 features. These include multi-horizon returns, volatility ratios, volume pressure, candle-shape statistics, Donchian/Bollinger/RSI/MACD variants, VWAP deviations, mean-reversion terms, choppiness indicators, time-of-day features, and interactions.

The results improve across validation and test:

Metric Baseline Evolved Meaning
Combined score 0.0848 0.1281 1.51x improvement on the validation fitness signal
Features 3 77 Much richer feature library
Validation $R^2$ at 10-min horizon 0.0021 0.0043 Roughly doubled
Validation mean daily IC 0.0736 0.1100 Better rank association
Test $R^2$ at 10-min horizon 0.0017 0.0034 Test improvement persists
Test mean daily IC 0.0592 0.0989 Test IC improves

This is main evidence for feature-engineering capability, not direct evidence of a better live trading system.

The paper is careful about that distinction. A better forecast metric does not automatically translate into better trading PnL. The executor was calibrated for the old alpha’s scale and noise profile. Change the alpha, and the same sizing constants can become wrong.

The hyperparameter calibration experiment makes this point sharply. With default execution parameters, the evolved forecaster looks better on validation but worse on the held-out test period: test impact-adjusted PnL drops from $46,791 for the baseline forecaster to $27,842 for the evolved forecaster. After recalibrating eight execution parameters with Optuna, the evolved forecaster becomes economically better: test impact-adjusted PnL rises to $159,967, compared with $103,089 for the recalibrated baseline.

This is one of the most transferable findings in the paper. In many businesses, improving an upstream model can damage downstream performance if the rest of the system expects the old signal distribution. A better fraud score may break manual review queues. A better demand forecast may destabilize inventory rules. A better lead score may overload sales capacity. A better trading alpha may lose money because the executor still thinks it is talking to yesterday’s alpha.

The model improved. The system was miscalibrated. Both statements can be true, which is why mature organizations test pipelines rather than components in isolation.

Joint optimization is powerful because components can co-adapt

Run 5 exposes both the feature-generation function and the full execution pipeline to evolution. Each candidate rebuilds the feature matrix, refits the ridge model on the 2022–2023 training split, feeds the resulting alpha into the evolved execution strategy, and scores the full pipeline on 2024 impact-adjusted PnL.

This is the full co-design setting. The feature pipeline can evolve toward signals that the execution strategy can exploit, and the execution strategy can evolve toward the signal’s scale, decay, and reliability.

The result is the highest Sharpe in the paper: 5.65 out of sample, compared with 3.82 for the baseline. Test impact-adjusted PnL rises from $46,791 to $724,217. The evolved system uses 79 features and trades about $3.6B in test volume.

The mechanism is not “more features equals better.” The appendix summaries show a more disciplined pattern: clipped feature families, multi-scale momentum, volatility-adjusted signals, volume-pressure terms, and execution logic where sizing and pricing are monotone in signal conviction and inventory utilization. In plain language, strong signals get more decisive action, inventory pressure widens the caution zone, and risk-off behavior changes order depth.

That is what co-adaptation looks like. The signal and the executor learn to speak the same dialect.

But Run 5 also has the strongest overfitting signature. Its validation-to-test PnL retention is roughly 39%, compared with about 51% for Run 1 and 54% for Run 2. Its win rate drops from 68.6% in validation to 49.8% in test. The test result is still strong, but the generalization gap is larger.

This is the right interpretation: joint optimization can find higher-quality systems, but it also increases the need for stricter validation. More degrees of freedom make real discoveries and backtest artifacts grow from the same soil. Fertilizer is not a governance policy.

The paper’s safeguards are part of the contribution

The paper includes several tests and diagnostics that should be read as part of the method, not as appendix decoration.

Test or diagnostic Likely purpose What it supports What it does not prove
Held-out 2025 test split Main generalization evidence Evolved candidates are not only winning on 2024 validation Live exchange profitability
Sizing-only counterfactual Robustness check against trivial scale-up PnL gains exceed what uniform trade scaling explains That all execution assumptions are realistic
Sharpe and Calmar checks Scale-invariant robustness checks Improvements are not merely larger position size Strategy capacity under real liquidity
Impact-adjusted PnL Implementation safeguard Fitness penalizes aggressive volume through modeled impact Actual realized slippage on a specific venue
Parameter budget Overfitting-control design choice Limits free-parameter creep in evolved code Complete protection from search overfitting
IS–OOS trajectory tracking P-hacking diagnostic OOS improves alongside IS rather than staying flat Immunity to future regime change
Baseline-shifted p-hacking null Statistical sanity check Results are unlikely under a simple best-of-$N$ noise story That the simulator itself matches the real market
Claude Code comparison Exploratory comparison Structured evolution is not the only viable agentic search pattern A definitive framework ranking

This table is where the business interpretation should slow down.

The p-hacking analysis is not a magic absolution. It addresses a specific question: are the observed gains likely to come from repeatedly sampling noisy variants around the baseline and selecting the luckiest one? The paper argues no, because out-of-sample performance improves and because the final results are far above a baseline-shifted best-of-$N$ null.

That is meaningful. It does not eliminate the deeper backtest-overfitting problem: the same simulator assumptions are used for validation and test. If those assumptions are wrong, the system can generalize within the simulated world and still fail outside it. The paper explicitly names several assumptions: that the aggregated data contain a realizable signal, that the fill model is adequate, and that the fee and market-impact structure is representative.

So the correct conclusion is not “the strategy is proven.” The correct conclusion is “the search process is not obviously just p-hacking under the paper’s simulator.” That is less glamorous. It is also far more useful.

Claude Code performs surprisingly well, but prompt sensitivity matters

The paper also compares MadEvolve with a Claude Code-style iterative agent. This is an exploratory extension, not the main evidence.

In the strategy-search experiment, Claude Sonnet 4.6 starts from the same initial strategy, receives an ideas tree of prior mutations and scores, and iteratively modifies either target-position logic, limit-order logic, or both. Out of 200 proposed mutations, 170 are evaluated successfully. The best Claude-discovered strategy reaches $583,783 validation impact-adjusted PnL and $340,326 test impact-adjusted PnL.

That is comparable to some MadEvolve runs, though below the strongest order-placement result. The search improves in phases: early gains from target-position changes, a large jump from convex power-law sizing, then incremental refinements in sizing and order placement.

The feature-search comparison is even more interesting. Claude evolves a feature-generation function and uses additional regularization: forward feature selection, a cap of 10 selected features, and a pairwise-correlation filter. Test forecast metrics improve strongly, and after recalibrating execution parameters, the Claude-evolved forecaster reaches $235,419 test impact-adjusted PnL with Sharpe 5.27. That beats the MadEvolve evolved-forecaster calibration result reported in the paper’s Section 5.7.

But the authors also report high prompt sensitivity. A slightly modified Claude workflow produced only modest improvements in another run. That matters. Agentic systems are still systems made of brittle protocols, not little researchers wearing invisible lab badges.

The business takeaway is not “Claude Code is better” or “MadEvolve is better.” The paper does not establish that. The better takeaway is that agentic search performance depends heavily on scaffolding: memory structure, candidate retention, mutation constraints, regularization, evaluation feedback, and prompt format.

In production, the agent is not the product. The loop is the product.

The business value is governed search, not autonomous trading

For Cognaptus readers, the practical lesson travels beyond crypto.

MadEvolve shows a pattern for using LLMs in noisy optimization problems:

Design element Trading example Business analogue
Fixed evaluation harness Backtester and PnL accounting cannot be mutated Immutable KPI calculation and test protocol
Mutable code blocks Only target sizing, order placement, or features evolve Controlled process modules open to AI redesign
Cost-aware objective Impact-adjusted PnL penalizes aggressive trading Reward includes operational cost, risk, latency, or failure cost
Population diversity MAP-Elites and islands preserve varied candidates Avoid converging too early on one local process design
Held-out evaluation 2025 test split not used for selection Future period, unseen customers, unseen stores, unseen workflows
Trivial-counterfactual checks Pure sizing baseline “Did we improve quality or just increase activity?”
Parameter budgets Cap tunable constants Prevent agents from hiding overfitting in knobs
Human-readable reports Summaries of evolved strategy logic Audit trail for why the system changed

This is useful for trading desks, but also for any company optimizing business logic through simulation: pricing rules, inventory policies, routing logic, collections workflows, marketing allocation, credit triage, energy dispatch, customer-support escalation, or fraud review.

The common structure is the same. There is a process. The process can be represented as code. The company has historical data or a simulator. The objective is measurable but noisy. The search space is too large for manual tuning. An LLM can propose structurally meaningful variants, but it must not be trusted to grade itself.

That last sentence is the entire governance model.

Where the paper should not be overread

The paper’s strongest claims are conditional. That does not weaken the work; it tells us how to use it.

First, the market data are aggregated BTCUSD minute bars, not a venue-specific order book. A strategy that fills inside candle ranges in a simulator may behave differently when placed into a real matching engine with queue position, latency, partial fills, adverse selection, and exchange-specific fee tiers.

Second, the impact model is calibrated and conservative in some respects, but it is still a model. Real execution costs can change with volatility, venue liquidity, crowding, and the strategy’s own footprint. A system optimized against one impact model may learn the model’s shape.

Third, the out-of-sample test is chronological and useful, but it is still part of the same historical environment and same simulator. It helps distinguish validation overfitting from generalization within the paper’s setup. It does not guarantee robustness to future regimes.

Fourth, the feature results may partly exploit artifacts in aggregated candle data. The authors explicitly note that some high values may reflect artificial structures in multi-exchange aggregated data. That is not a minor footnote for live trading. It is the sort of thing that separates research from expensive performance art.

Fifth, the comparison across LLM models will age quickly. The durable point is that ensemble diversity helped; the specific model ranking is not a law of nature. In AI, the half-life of a leaderboard can be shorter than a poorly managed options trade.

What this paper changes about AI trading workflows

The old workflow for systematic trading research is painfully familiar: researchers write ideas, backtest them, tune parameters, debate whether the result is real, and slowly accumulate scars.

The naive AI workflow replaces the researcher with an LLM and hopes the scars become someone else’s problem.

The better workflow, illustrated by this paper, is different:

  1. Define a simulator whose assumptions are explicit.
  2. Freeze the evaluation harness.
  3. Expose only selected modules to mutation.
  4. Let LLMs generate candidate code.
  5. Score candidates automatically.
  6. Preserve diversity, not just the latest winner.
  7. Test held-out performance.
  8. Attack cheap explanations.
  9. Calibrate downstream components when upstream signals change.
  10. Treat live deployment as a separate validation problem.

This is not glamorous. It is a factory.

But in noisy domains, factories beat oracles. A factory can be inspected, stress-tested, improved, and audited. An oracle mostly gives you confidence, which is convenient because confidence is cheaper than risk control.

The paper’s contribution is therefore not that AI has solved trading. It has not. The contribution is that LLM-driven evolutionary search can improve trading-system components inside a disciplined experimental scaffold, and that the scaffold itself is the serious part of the invention.

For business leaders, that is the useful lesson. Do not ask whether an LLM can make decisions. Ask whether your organization can build the harness that makes its proposed decisions testable, rejectable, and expensive to fake.

Because once the order is placed, the market does not care how clever the prompt was.

Cognaptus: Automate the Present, Incubate the Future.


  1. Yurii Kvasiuk, Tianyi Li, Owen Colegrove, and Moritz Münchmeyer, “MadEvolve: Evolutionary Optimization of Trading Systems with Large Language Models,” arXiv:2605.23007v1, 21 May 2026. ↩︎