TL;DR for operators
Portfolio teams do not usually fail because they have no models. They fail because the models age, the signals decay, and the process of discovering new sparse selection logic is slow, expensive, and wonderfully allergic to market regime shifts.
The paper behind EFS — Evolutionary Factor Search — proposes a useful change in framing: stop asking the LLM to “pick stocks” and ask it to generate executable alpha-factor formulas that can be backtested, filtered, evolved, and used to rank assets under sparse portfolio constraints.1 That distinction matters. The LLM is not the portfolio manager. It is the factor-factory intern with suspicious stamina. The backtest loop is still the adult in the room.
Operationally, EFS is interesting because it connects four pieces that are often separated: interpretable factor generation, evolutionary search, sparse top-$m$ asset selection, and rolling performance feedback. The authors test it on five Fama-French benchmark datasets and three real-market datasets — US50, HSI45, and CSI300 — and report strong performance against equal-weight, classical optimisation, machine-learning selectors, and recent sparse-portfolio baselines.
The strongest business reading is not “LLMs can trade now”. Please, let us keep one foot on the floor. The better reading is that LLMs may become useful research automation systems for expanding, mutating, and stress-testing factor libraries. The unresolved part is production realism: turnover, transaction costs, concentration risk, online monitoring, prompt failures, and live-market validation. EFS makes factor research cheaper to iterate. It does not repeal market microstructure, because markets have not been that generous lately.
The problem is not “which stocks should the AI buy?”
The lazy version of this story is simple: an LLM looks at markets, chooses assets, beats baselines, and finance is now solved. Conveniently, that version is also wrong.
EFS does something narrower and more technically useful. It converts sparse portfolio optimisation into a factor-guided ranking problem. Instead of solving directly for a portfolio weight vector under an $\ell_0$ constraint, the system searches for formulas that score each asset. Those scores are aggregated, assets are ranked, and the top $m$ are selected.
The original sparse portfolio problem is usually written as:
That last constraint, $|w|_0 \le m$, is the sting. It says the portfolio can hold at most $m$ assets. This is not just a neat mathematical inconvenience. It reflects real operational constraints: limited capital, transaction costs, risk concentration, explainability requirements, and the old-fashioned desire not to own 137 tiny positions because an optimiser had a poetic afternoon.
Classical sparse optimisation methods try to solve or approximate that constrained problem directly. EFS sidesteps part of the combinatorial pain by asking a different question: can we generate better ranking functions for identifying the few assets worth selecting?
That replacement is the paper’s central mechanism. It is also the main protection against misunderstanding. EFS is not a foundation model staring into the market and receiving a revelation. It is a controlled loop for creating, testing, and pruning factor expressions.
EFS works because the LLM is boxed in
The system begins with a small seed library of basic price-based factors. These include familiar components such as mean return, volatility, momentum, maximum drawdown, Sharpe-style ratios, moving averages, Bollinger Band width, exponential moving-average ratios, and RSI. This is not a giant proprietary factor zoo. It is deliberately modest.
That matters because the authors want the observed gains to come from evolution, not from smuggling in a huge pre-engineered library and then letting the LLM take credit at the press conference.
The mechanism has two stages.
First, EFS warms up the factor library. Initial factors are evaluated over a look-back window, producing performance statistics such as RankIC, cumulative wealth, Sharpe ratio, and related quality measures. These statistics become feedback for later generation.
Second, the system enters an iterative loop. At a search interval, typically weekly in the paper’s description, EFS selects top-performing factors from the current pool and constructs prompts for the LLM. The prompt does not reveal raw tickers, price series, dates, or market phases. It provides factor definitions and anonymised performance summaries. The LLM then produces new candidate Python functions through structured mutation and crossover.
Mutation means modifying an existing factor: changing parameters, adjusting logic, or altering internal operators. Crossover means combining two factor structures into a new expression. The result is not a paragraph of investment advice. It is executable code.
Then comes the important part: EFS filters the generated candidates. It checks syntax, validates the factors, keeps high-quality and diverse variants, prunes weak ones, and updates the factor pool. Only after that does the system use the surviving factors to score assets and construct the sparse portfolio.
A simplified view:
| Step | What happens | Why it matters |
|---|---|---|
| Seed | Start from basic technical and statistical factors | Gives the LLM a sane search space rather than a blank canvas, which is generally where nonsense goes to breed |
| Evaluate | Backtest factors using ranking and portfolio metrics | Turns generation into selection, not vibes |
| Prompt | Feed top factors and performance summaries to the LLM | Lets the model mutate from evidence rather than invent from nothing |
| Generate | Produce candidate Python factor functions | Makes the output executable and inspectable |
| Filter | Remove invalid, redundant, or weak factors | Keeps the factor pool from becoming a landfill |
| Rank | Aggregate factor scores and select top-$m$ assets | Converts factor discovery into sparse portfolio construction |
| Repeat | Roll forward and evolve | Allows adaptation as market regimes shift |
The “LLM” part is therefore not magic. It is structured search with a language model acting as a code-generating heuristic engine. That is less glamorous than “AI trader”, but substantially more useful.
Ranking is the point, not return prediction
A lot of finance machine learning tries to predict returns. That sounds natural, and it is often operationally treacherous. Small errors in expected returns can produce large allocation errors, especially under constraints. Sparse portfolios make the problem nastier because the system must distinguish the best few assets, not merely estimate an average tendency across a broad universe.
EFS focuses on ranking metrics such as RankIC. RankIC measures the Spearman rank correlation between factor scores at time $t$ and realised returns at time $t+1$. In plain English: did the factor rank future winners above future losers?
That is a sensible metric for sparse selection. If the portfolio only holds the top $m$ assets, the factor’s ability to separate the best candidates matters more than its ability to predict exact return magnitudes. A factor can be poorly calibrated in absolute terms and still useful if its ordering is good. Conversely, a smooth return model can look statistically elegant while failing to identify the top tail. Finance has many such elegant disappointments.
The paper’s use of RankIC and RankICIR also makes the evolutionary loop more targeted. The LLM is not being rewarded for producing pretty formulas. It is being guided by whether those formulas improve the ordering of assets under portfolio-relevant constraints.
The main evidence is broad, but not all evidence plays the same role
The paper includes several layers of evidence. They should not be treated as one undifferentiated blob of “results”, because that is how nuance gets mugged in an alley.
| Evidence item | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Fama-French benchmark tests | Main evidence | EFS performs strongly across standard academic portfolio datasets | That the method will survive live trading frictions |
| US50, HSI45, CSI300 tests | Main evidence plus practical extension | EFS works on daily real-market asset pools across different regions and regimes | That the chosen universes are enough to establish general market universality |
| Prompt ablations | Ablation | Performance feedback, numeric metrics, sparse heuristics, and TA seeds matter | That the prompt is optimal |
| Candidate generation volume tests | Sensitivity test | More generated factors per step can dilute quality and reduce evolutionary pressure | That one fixed generation size is best everywhere |
| Transaction-cost tests | Robustness / realism check | Costs reduce returns materially, but EFS remains competitive in several settings | That turnover is solved |
| Score-to-weight experiments | Exploratory extension | Using factor scores for weighting can boost returns but may create concentration and tail risk | That aggressive weighting is deployable |
This distinction is important. The benchmark tables support the main claim that EFS is a competitive sparse portfolio method in backtests. The ablations support the mechanism: feedback matters. The transaction-cost and weighting tests are not a victory lap; they are where the method starts negotiating with reality.
Reality, being petty, asks for turnover control.
The headline results are strong, especially in sparse real-market tests
On the five Fama-French datasets — FF25, FF32, FF49, FF100, and FF100MEOP — the paper reports that EFS outperforms the tested baselines under the standard sparse setting. The advantage becomes more visible in larger asset universes, which is precisely where sparse selection becomes more painful.
The real-market results are more concrete for operators. The authors evaluate US50, HSI45, and CSI300. In the main real-market table, under $m=10$, EFS beats the listed baselines on cumulative wealth and Sharpe ratio across the three datasets, though drawdown comparisons vary by market and method.
A compressed view of the reported $m=10$ real-market table:
| Dataset | Baseline 1/N CW / SR / MDD | Strongest EFS CW / SR / MDD in main table | Practical reading |
|---|---|---|---|
| US50 | 4.562 / 0.072 / 0.344 | EFS-DeepSeek: 25.101 / 0.132 / 0.288; EFS-GPT: 22.905 / 0.130 / 0.260 | Large return improvement with better drawdown than equal weight |
| HSI45 | 1.333 / 0.029 / 0.409 | EFS-DeepSeek: 3.463 / 0.080 / 0.385; EFS-GPT: 2.789 / 0.067 / 0.292 | Stronger Sharpe and materially better drawdown for GPT variant |
| CSI300 | 1.087 / 0.014 / 0.214 | EFS-GPT: 4.962 / 0.098 / 0.301; EFS-DeepSeek: 3.437 / 0.079 / 0.327 | Strong return and Sharpe gains, but drawdown is higher than 1/N |
That last row deserves attention. EFS improves cumulative wealth and Sharpe on CSI300, but its maximum drawdown is not lower than equal weighting in the main table. This does not invalidate the result. It prevents the wrong interpretation. EFS is not simply “higher return, lower risk, everywhere, always”. It is a ranking and selection system that can generate stronger performance, sometimes with more drawdown depending on market and configuration.
The authors also report repeated-run and transaction-cost variants in the appendix. In the no-cost repeated-run setting under $m=10$, EFS-GPT reaches CW of 39.746 on US50, 3.338 on HSI45, and 3.862 on CSI300, with corresponding Sharpe ratios of 0.154, 0.076, and 0.086. Those numbers are impressive, but the variance is not decorative. US50’s GPT result is reported as $39.746 \pm 15.484$, which means stochastic search variation is part of the operating profile.
That is not a fatal weakness. It is a deployment requirement. Any desk using this kind of system would need search aggregation, stability filters, and run-to-run monitoring. Conveniently, the paper already recognises this by pooling results across multiple runs and discussing aggregation as a way to smooth LLM output variability.
The ablations show that feedback is the engine
The ablation results are among the most useful parts of the paper because they test whether EFS works for the reason the authors claim.
The short answer: yes, mostly.
Using DeepSeek with three repeated trials, the authors remove components from the prompt and initial factor setup. The most damaging ablations are the ones that remove performance feedback or remove the technical-analysis seed factors.
On US50, the full EFS-DeepSeek setup reports CW of $32.993 \pm 6.044$, SR of $0.149 \pm 0.003$, and RankIC of $0.027 \pm 0.001$. Remove performance feedback, and CW drops to $9.549 \pm 5.869$, SR to $0.094 \pm 0.019$, and RankIC to $0.009 \pm 0.009$. Remove TA factors, and CW falls further to $5.367 \pm 1.652$, with RankIC near zero.
That is the mechanism in numerical form. The LLM is not valuable because it is generative in the abstract. It is valuable because generation is conditioned on recent evidence and constrained by usable structure. Take away the feedback and the system starts wandering. Take away the seed structure and it loses useful priors.
This is a useful lesson beyond this specific paper. In financial AI, open-ended generation is rarely the product. The product is generation plus measurement plus rejection. The rejection step is where most of the intellectual hygiene lives.
More factors are not automatically better, because apparently abundance still has consequences
One counterintuitive result concerns generation volume. The paper varies $M$, the number of LLM-generated candidate factors per search step. Generating fewer candidates, such as $M=5$, can produce stronger average portfolio performance but with higher variance. Increasing $M$ introduces diversity, but the paper reports deterioration in Sharpe, RankIC, and RankICIR.
This makes practical sense. LLM output quality is not independent of batch size. Ask for too much at once and the system may produce shorter, more redundant, or more trivial expressions. More candidates can also weaken evolutionary pressure: the search becomes broader but less selective early on.
For operators, this is not a cute prompt-engineering footnote. It says the factor factory needs production controls. Candidate volume, filtering thresholds, search frequency, retry rules, and diversity management are parameters of the research process. They are not clerical details. They shape the signal inventory.
The same applies to warm-up. The paper finds limited gains from longer pre-backtest warm-up phases. Historical factor mining by itself does not do much unless tightly coupled to rolling evaluation. That is another useful warning. Static factor libraries decay. Static LLM-generated factor libraries are still static factor libraries, just with better stationery.
The generated factors are interpretable, but not necessarily simple
The paper gives examples of generated factors that combine short-term price deviation, EMA-normalised momentum, Bollinger Band width, mean return, momentum, and volatility adjustment. These are not opaque neural-network embeddings. They are Python functions with readable internal components.
That is a genuine advantage. A quant researcher can inspect the expression, identify whether it is trend-following, mean-reverting, volatility-filtered, breakout-oriented, or some hybrid. Debugging is possible. Governance is easier. A risk committee might still complain, obviously, but at least it would have something concrete to complain about.
The generated factors also illustrate why LLMs are well-suited to this particular role. They are not merely selecting from a fixed menu. They can recombine known financial motifs into new formulas, while respecting code-format constraints and numerical safety rules. The paper’s prompt asks for executable NumPy functions, forbids external dependencies, requires edge-case handling, and constrains naming and window sizes.
That is the correct level of ambition for current LLMs in quant infrastructure: not “discover truth”, but “generate structured candidates inside a controlled grammar, then let empirical testing decide who survives”.
Transaction costs do not kill the story, but they do change the story
The main experiments exclude transaction costs. That is standard enough in research, but in trading it is also where many promising strategies go to discover gravity.
The appendix adds transaction-cost tests at $c=0.1%$ and $c=0.2%$ per reallocation step under $m=10$. Returns fall materially. For EFS-GPT on US50, CW drops from $39.746 \pm 15.484$ with no cost to $25.293 \pm 10.802$ at 0.1% and $16.114 \pm 7.464$ at 0.2%. For CSI300, EFS-DeepSeek falls from $2.451 \pm 0.412$ to $1.668 \pm 0.227$ and then $1.136 \pm 0.130$.
That is still not a collapse across the board, especially for US50. But it is a large haircut. The authors attribute the decline mainly to equal weighting and frequent portfolio shifts. When the top-ranked assets change, the strategy trades. When the strategy trades, costs arrive. They are punctual like that.
This is one of the cleanest business boundaries in the paper. EFS is promising as a signal-generation and sparse-ranking engine. A deployable version would need turnover-aware selection, holding-aware filters, transaction-cost-adjusted objectives, and probably constraints on rebalance intensity. Otherwise the strategy may spend too much of its alpha paying the market’s cover charge.
Score-based weighting is powerful, and therefore dangerous
The paper also tests converting factor scores into portfolio weights rather than using equal weights. This can produce much larger cumulative wealth. In one reported table, under score-to-weight allocation with different temperature settings, US50 cumulative wealth reaches very high levels — for example, DeepSeek under $m=10$ reaches CW of 389.010 at temperature 2.0, while GPT-4.1 reaches 332.340.
This looks fantastic until one reads the explanation. The authors note that exceptional final wealth is partly driven by allocating nearly all capital to a few assets that experienced extreme upward moves. That can exploit rare opportunities. It can also produce substantial tail risk.
This is not a flaw in the paper; it is a useful honesty test. Score weighting turns factor confidence into capital concentration. If the confidence is right, the backtest smiles. If it is wrong, the portfolio becomes a cautionary tale with a Bloomberg terminal.
The authors recommend equal-weighted or temperature-smoothed versions for practical use. That is the conservative conclusion, and it is probably the correct one. In production, the score-to-weight variant should be treated as an exploratory extension, not the default operating mode.
What Cognaptus infers for business use
The business value of EFS is not that it replaces a quant team. It changes what the team can iterate.
| Paper directly shows | Cognaptus inference for operators | What remains uncertain |
|---|---|---|
| LLMs can generate executable alpha-factor formulas through constrained prompts | Quant teams can use LLMs to expand the candidate factor search space faster | Whether generated factors survive live deployment |
| Performance feedback is critical in ablations | The evaluation loop is more valuable than one-shot generation | How to design feedback under changing cost, risk, and liquidity constraints |
| EFS performs strongly on Fama-French and real-market datasets | Sparse portfolio research can benefit from ranking-oriented factor evolution | Whether the result generalises to other universes, asset classes, and intraday settings |
| Transaction costs reduce returns materially | Turnover control must be built into the objective, not patched on later | The optimal turnover-aware version is not yet established |
| Score weighting can amplify returns but creates concentration | Factor scores are useful for ranking; aggressive capital allocation needs separate risk governance | How to calibrate confidence without overfitting |
For asset managers, the most immediate application is research acceleration. EFS-like systems could generate factor variants, maintain candidate libraries, perform rolling evaluations, and surface interpretable signals for human review. That reduces the bottleneck in manual factor ideation.
For fintech platforms, the opportunity is different. EFS suggests a way to build explainable strategy-generation modules where users can inspect signal logic rather than receive mysterious model scores. That matters for trust, compliance, and customer education. It is much easier to explain “this factor favours low-volatility breakouts” than “the neural network liked it”.
For institutional portfolio construction, the sparse-ranking framing may be especially relevant. Many real portfolios are not unconstrained mean-variance toys. They are cardinality-constrained, compliance-constrained, turnover-constrained, liquidity-constrained collections of compromises. A system that directly works with sparse selection is therefore closer to the operating problem.
But the paper’s evidence is still backtesting evidence. It supports further development. It does not support firing the investment committee and replacing it with a prompt template, tempting though that may be on certain Mondays.
Boundaries before anyone wires this into production
There are five practical boundaries.
First, transaction costs and turnover are not peripheral. The appendix shows that costs materially reduce cumulative wealth. A production system should optimise net returns, not just gross backtest performance.
Second, LLM output variability matters. The authors mitigate this through repeated runs and aggregation. A real deployment would need stronger versioning, logging, retry policies, validation gates, and drift monitoring.
Third, the paper’s LLMs receive anonymised performance feedback and do not receive raw tickers, dates, or price series. That is a thoughtful data-leakage control. Still, live deployment would require stricter audit trails, especially if the model provider, prompt contents, or external data access changes.
Fourth, the factor universe is price-based. The experiments use closing prices and returns, with generated factors constrained around those inputs. That keeps the system clean and interpretable, but it also limits the information set. Future versions with fundamentals, news, filings, options data, or macro variables would be a different system with different leakage and overfitting risks.
Fifth, the evaluation horizon is still historical. The method adapts within backtests, but markets adapt too. Once a strategy becomes widely used, its edge may compress. Alpha, unlike software margins, tends to be rude about scale.
The real lesson: build better research loops, not louder AI claims
EFS is valuable because it treats the LLM as part of a disciplined search process. It does not ask the model for a market prophecy. It asks for candidate factor code, tests that code, keeps what works, discards what fails, and repeats.
That is the architecture worth paying attention to.
The paper’s strongest contribution is not merely that GPT-4.1 and DeepSeek-backed variants perform well in backtests. It is the demonstration that language models can participate in an evolutionary research loop where financial ideas are generated as executable, interpretable formulas and judged by sparse portfolio outcomes.
That is a practical direction for AI in finance. Less oracle. More factory. Less “the model knows”. More “the model proposes, the backtest disposes”.
A wonderfully unromantic arrangement. Which, in finance, is usually a feature.
Cognaptus: Automate the Present, Incubate the Future.
-
Haochen Luo, Yuan Zhang, and Chen Liu, “EFS: Evolutionary Factor Searching for Sparse Portfolio Optimization Using Large Language Models,” arXiv:2507.17211, 2025. https://arxiv.org/pdf/2507.17211 ↩︎