From One Shot to Many: Why AI Should Stop Guessing and Start Exploring

One answer is tidy. One answer is easy to grade. One answer also happens to be a strangely fragile way to use AI.

That is not just a philosophical complaint about creativity, brainstorming, or whether a chatbot sounds confident enough while being quietly wrong. It becomes a technical problem when AI systems generate artifacts that other systems must consume: code, formal specifications, compliance rules, database transformations, contracts, workflows, or mathematical statements. In those settings, the generated object is not merely a sentence. It is an interface.

An interface can be valid and still be inconvenient. It can compile and still be awkward to prove. It can preserve the meaning of an informal request and still send the downstream solver into a swamp. Very professional. Very modern. Very avoidable.

This is the useful idea behind the arXiv paper FormalEvolve: Neuro-Symbolic Evolutionary Search for Diverse and Prover-Effective Autoformalization.¹ The paper studies Lean 4 autoformalization: translating informal mathematical claims into machine-checkable Lean statements. But the larger lesson travels beyond theorem proving. FormalEvolve argues that autoformalization should not be treated as a single-output translation problem. It should be treated as budgeted search over a repertoire of candidate formalizations.

That sounds like a small reframing. It is not. It changes the unit of evaluation from “Did the model produce one acceptable statement?” to “Did the system build a useful set of verified alternatives under a fixed budget?”

For business AI, that distinction matters because many deployment failures are not caused by the model being completely wrong. They are caused by the model producing something that passes the first check but fails the workflow that actually matters.

The real failure mode is not invalid output; it is valid-but-unhelpful output

Autoformalization has an obvious first hurdle: the output must compile. In Lean, compilation means the statement passes elaboration and type checking under a specified Lean/Mathlib environment. If it does not compile, it cannot be used for formal proving. This is the clean part.

The second hurdle is semantic faithfulness. A formal statement may compile while saying the wrong thing. A compiler does not know whether the Lean statement faithfully captures the informal theorem. It only knows whether the Lean object is well-formed. As anyone who has ever received syntactically valid nonsense from an enterprise automation system can confirm, “well-formed” is not exactly the same thing as “useful.”

FormalEvolve adds a third hurdle that is easy to miss: prover-facing usefulness. Even if two Lean statements compile and are judged semantically faithful, they may behave differently under the same automated prover and the same proof-attempt budget. The paper calls this the gap between semantic consistency and prover equivalence.

This is the central misconception the article should remove: a compilable, semantically faithful formal statement is not automatically equivalent to every other faithful statement for downstream proof search.

A formal statement is therefore not only an endpoint of translation. It is also a target surface for another machine. Slight changes in binder structure, theorem shape, imports, type annotations, or statement decomposition may make the proof search easier or harder. In human terms: the same idea can be phrased in a way that makes the next person immediately understand it, or in a way that makes them stare at the ceiling for forty minutes. Machines have their own version of that ceiling.

The paper’s mechanism starts from this observation. If one faithful statement can be prover-hostile, the system should not bet the whole budget on one statement. It should build a repertoire.

FormalEvolve turns autoformalization into repertoire search

FormalEvolve is a test-time search framework for Lean 4 statement generation. It starts from candidate formal statements, stores compilable ones in an archive, mutates and recombines them, repairs failures, and reports a deduplicated set of candidates that both compile and pass a semantic judge.

The simplest picture is this:

Informal theorem
      ↓
Seed candidate Lean statements
      ↓
Compilation-gated archive
      ↓
LLM mutation / diff patch / cross patch
      ↓
Bounded compile repair and semantic repair
      ↓
Optional conservative AST rewrites when patching stalls
      ↓
Deduplicated semantic repertoire
      ↓
Fixed-budget downstream prover attempts

The important design choice is that FormalEvolve separates the archive from the final reported repertoire.

The archive stores unique candidates that compile. Some of them may fail the semantic judge. They are still useful as stepping stones because they contain workable Lean structure: imports, binders, declarations, or proof-facing shapes that future edits can reuse. The reported repertoire is stricter: it includes only deduplicated candidates that both compile and pass the semantic-faithfulness judge.

That separation is more practical than it first appears. If the archive stored only perfect candidates, search would become too brittle. If the final output exposed every compilable candidate, semantic drift would pollute evaluation. FormalEvolve keeps the messy workshop separate from the showroom. A radical idea, apparently.

The mechanism is an archive, not just more sampling

A natural objection is that FormalEvolve might simply be “sample more candidates.” The paper spends a considerable amount of experimental design trying to avoid that lazy interpretation.

The baselines include memoryless, no-archive methods under the same generator-call budget of $T = 100$ per problem. These baselines perform fresh sampling and repair without maintaining reusable archive state. The key matched comparison is the Hybrid control: it uses the same Kimina-generation and Qwen3-repair stack as FormalEvolve, but disables archive state, parent reuse, archive context, usage-penalized selection, and island migration.

That matters because the paper’s claim is not merely that a stronger repair model helps. It clearly does. The question is whether archive-based search adds something on top of stronger repair.

FormalEvolve’s archive mechanism has several moving parts:

Mechanism	What it does	Why it matters
Compilation gate	Only compilable candidates enter the archive	Keeps the search inside Lean’s feasible region
Semantic judge	Filters candidates for reported semantic success	Prevents compile-only nonsense from becoming “success”
Mutation and patching	Rewrites full statements, local diffs, or combinations with archive inspirations	Explores nearby and recombined formalizations
Bounded repair	Uses compiler or judge feedback, with repair calls debited to the same budget	Prevents one hard case from eating unlimited compute
Usage-penalized selection	Reduces repeated reuse of the same parent candidate	Avoids collapsing onto one easy template
EvolAST fallback	Applies conservative generator-call-free AST rewrites after duplicates or compile failures	Adds structural variation, but only within a controlled rewrite surface

The usage penalty is especially interesting. Candidates judged semantically consistent receive better selection weight, but repeatedly selected parents are discounted. In plain terms, the system prefers promising candidates without letting one candidate become the office superstar who attends every meeting and somehow blocks all actual work.

The EvolAST component is more delicate. It performs conservative symbolic rewrites over theorem structure, such as reordering hypotheses or applying symmetry-like transformations. It is generator-call-free, but not cost-free: the resulting candidates still need compilation and semantic judging. The appendix makes this distinction explicit by reporting judge-call overhead separately.

This budget accounting is not a minor implementation detail. It is part of the paper’s intellectual discipline. When a system claims to improve generation under a fixed budget, it must say which costs are inside the budget and which are evaluator-side costs. Otherwise “test-time search” becomes a polite way of saying “we spent more money and hoped nobody asked where.”

The main evidence: better coverage, less concentration, modest proof gains

FormalEvolve is evaluated on ProofNet, using the Lean 4 port test split with 186 problems, and CombiBench, a 100-problem benchmark with a domain and style shift. The statement-generation budget is $T = 100$ generator calls per problem. The downstream proving evaluation uses Goedel-Prover-V2-32B with a fixed $B = 64$ prover-attempt budget.

The headline statement-generation result is straightforward. Against the matched Hybrid no-archive control, FormalEvolve improves SH@100, the fraction of problems with at least one compilable and judge-consistent candidate:

Benchmark	Hybrid no-archive SH@100	FormalEvolve SH@100	Difference
CombiBench	0.530	0.580	+0.050
ProofNet	0.828	0.849	+0.021

These are not magic jumps. They are controlled, fixed-budget gains. That is actually better for interpretation. The paper is not claiming that archive search turns theorem proving into solved infrastructure. It is showing that when the comparison is budget-matched and repair-model strength is controlled, archive search still adds measurable coverage.

Coverage alone, however, can be misleading. A method could improve aggregate semantic-success counts by generating many variants for already-easy problems. FormalEvolve therefore reports concentration metrics: Gini and top-10% share over per-problem semantic-success counts. Lower concentration means successes are less dominated by a small group of easy problems.

Against the Hybrid control, FormalEvolve reduces Gini from 0.790 to 0.759 on CombiBench and from 0.505 to 0.443 on ProofNet. Top-10% share also falls, from 0.588 to 0.531 on CombiBench and from 0.241 to 0.229 on ProofNet.

This is one of the paper’s more business-relevant measurements. It says the system is not merely polishing the same easy cases. It is spreading semantic successes more evenly across the problem set. In enterprise terms, that is the difference between a demo that works on five friendly examples and a workflow that starts to cover the annoying tail.

The proof result is useful, but it should be read carefully

The downstream proof evaluation is where the “repertoire” thesis becomes concrete. For each problem, the prover receives up to 64 deduplicated, judge-accepted candidate statements from the repertoire. Empty repertoires count as failures. The paper reports pass@64, complete@64, and theorem-complete@64.

Theorem-complete@64 is the strictest and most relevant metric here. It requires a complete proof of an explicit theorem or lemma, not just a Lean-accepted file with auxiliary definitions or incomplete proof content. This is particularly important on CombiBench, where the appendix notes that 9–13% of prompts can be abbrev-only, making theorem-complete a stricter success signal.

The downstream result is modest but directionally aligned with the mechanism:

Benchmark	Hybrid theorem-complete@64	FormalEvolve theorem-complete@64	Difference
CombiBench	9/100	13/100	+4 problems
ProofNet	40/186	45/186	+5 problems

This supports the hedging argument: a diverse accepted repertoire increases the chance that at least one candidate is friendly to the prover under a fixed attempt budget.

But there is a boundary. FormalEvolve does not dominate every baseline in every downstream view. On ProofNet, Kimina Compile+Semantic Repair reports theorem-complete@64 of 46/186, slightly above FormalEvolve’s 45/186. The paper is honest enough to include non-dominance cases: sometimes the baseline finds a prover-friendly statement and FormalEvolve does not. This is not a bug in the argument; it is the argument. Different statement variants expose different proof-search behavior.

The practical interpretation is therefore not “archive search always wins.” It is “single-output success is the wrong unit, and archive search can improve fixed-budget odds while still requiring downstream validation.” Less catchy. More useful.

The ablations show repair is the workhorse, while diversity is sensitive

The ablation results are where a normal summary can easily become misleading. FormalEvolve contains multiple components, but they do not contribute equally.

Removing patch repair drops SH@100 from 0.580 to 0.470 on CombiBench and from 0.849 to 0.780 on ProofNet. The paper explicitly identifies bounded repair as the largest direct driver of semantic hit rate. That means the archive should not be romanticized as a mystical evolutionary engine. Much of the practical gain comes from repairing candidates using compiler and semantic feedback, then making repaired candidates reusable as future parents and prompt context.

The island and EvolAST effects are smaller and more dataset-dependent. The K=1 archive variant, which removes island separation while keeping archive-based parent reuse and usage-penalized selection, actually improves ProofNet coverage and concentration and improves CombiBench concentration, although the default K=2 setting gives the strongest CombiBench SH@100 among FormalEvolve variants.

EvolAST is even more cautionary. On CombiBench, the default K=2 with EvolAST improves SH@100 over the no-EvolAST ablation, 0.580 versus 0.500. On ProofNet, however, the no-EvolAST variant has higher SH@100, 0.871 versus 0.849. The appendix explains why this can happen: a conservative AST rewrite can preserve the wrong semantic skeleton. If the current candidate is structurally feasible but semantically misaligned, a symbolic rewrite may produce many beautifully compilable variations of the same mistake. Very elegant. Still wrong.

This is why the paper’s mechanism-first reading matters. FormalEvolve is not “diversity good.” It is “verified, budgeted, feedback-guided diversity can improve search, but diversity operators must be audited because they alter the search trajectory.”

Experimental element	Likely purpose	What it supports	What it does not prove
Hybrid no-archive control	Main controlled comparison	Archive search adds gains beyond stronger repair-model use	Archive search is universally superior
No patch repair	Ablation	Bounded repair is the largest direct SH@100 driver	Diversity alone explains the gains
K=1 vs K=2	Implementation sensitivity	Island separation is not the whole story	Two islands is generally optimal
No EvolAST	Robustness / component sensitivity	Symbolic fallback can help or hurt depending on dataset and trajectory	AST rewrites are always beneficial
Stronger-base CombiBench test	Robustness/sensitivity check	Archive pattern persists with stronger seed/repair stack	Same magnitude transfers to all systems
Manual audits	Calibration of LLM-judge positives	Archive search does not visibly amplify semantic drift in the audited slice	Every judge-positive output is faithful

The strongest-base check is a useful robustness signal, not a second thesis

The paper includes a stronger-base statement-generation evaluation on CombiBench using Goedel-Formalizer-V2-8B for seeding and GPT-5.4 for repair, with the same $T = 100$ accounting and semantic judge. This is not the main experiment. It is a robustness or sensitivity check: does archive search still help when the base generator and repair stack are stronger?

The answer is yes, in the statement-stage metrics. FormalEvolve raises SH@100 from 0.86 to 0.88 compared with the matched no-archive Hybrid control, increases deduplicated SemOKtotal from 1,886 to 3,238, and reduces concentration: Gini falls from 0.4832 to 0.4383, while top-10% share falls from 0.3128 to 0.2202.

The magnitude should not be oversold. The SH@100 gain is small. The more interesting result is that the archive produces a much larger and less concentrated pool of judge-accepted candidates. That supports the paper’s system-level claim: archive search is not merely compensating for a weak base model. It can still change the shape of the candidate distribution when the base stack improves.

For business readers, this is the right analogy. As foundation models improve, the value of orchestration does not disappear. It shifts. The orchestration layer becomes less about rescuing obviously bad outputs and more about shaping verified alternatives for downstream tools.

What this means for enterprise AI systems

FormalEvolve is about Lean 4 mathematics, not invoice processing, contract automation, SQL generation, or compliance review. Still, the pattern is portable because the underlying workflow problem is common: an AI system generates an artifact, a hard checker can reject some failures, and a downstream tool or human process determines whether the artifact is actually useful.

The business inference is this:

When many valid artifacts can express the same intent, the production system should often maintain a verified candidate portfolio instead of forcing the model to nominate a single winner too early.

This applies most naturally when three conditions hold.

First, there must be a hard or semi-hard verifier. In Lean, that verifier is compilation. In software engineering, it may be a compiler, test suite, type checker, linter, security scanner, or runtime sandbox. In data work, it may be schema validation, reconciliation checks, database constraints, or deterministic unit tests. In compliance workflows, it may be rule coverage, required-clause detection, or structured policy validation.

Second, there must be meaningful variation among valid artifacts. If all valid outputs behave identically downstream, a repertoire is waste. But many enterprise artifacts are not like that. Two SQL queries can return the same intended result while differing in performance and maintainability. Two contract clauses can satisfy a policy while differing in negotiation risk. Two workflow plans can complete the same task while creating very different failure surfaces.

Third, downstream performance must be sensitive to representation. This is the core bridge from theorem proving to business automation. FormalEvolve’s theorem statements are prover-facing interfaces. Enterprise outputs are often tool-facing, auditor-facing, user-facing, or execution-facing interfaces. Representation is not decoration. It changes what happens next.

A practical enterprise version of the FormalEvolve pattern might look like this:

FormalEvolve component	Enterprise AI analogue	Operational purpose
Informal theorem	Business request or natural-language policy	Defines intent
Lean compilation	Hard validation: compile, test, schema, policy, sandbox	Rejects infeasible artifacts
Semantic judge	Business-faithfulness evaluator or human review rubric	Checks whether the artifact still matches intent
Archive	Candidate repository with metadata and outcomes	Reuses good structures instead of starting over
Mutation / crossover	Prompted edits and recombinations from prior candidates	Explores useful alternatives
Bounded repair	Tool-feedback repair with fixed budget	Prevents runaway correction loops
Downstream prover	Actual execution environment, reviewer, customer workflow	Measures whether the artifact is useful in practice

The ROI argument is not “generate more.” That is how compute bills become performance art. The ROI argument is “spend test-time compute where representation affects downstream success, and preserve validated candidates so each repair improves the search space rather than disappearing after one attempt.”

The business value is tail coverage and diagnosis, not just better averages

The most business-relevant metric in the paper may not be SH@100 itself. It may be concentration.

Enterprise AI often fails unevenly. It performs well on common, clean, easy cases and then collapses on edge cases that account for disproportionate human escalation. If a new method improves the easy cases again, the average dashboard looks pleasant while the operations team remains spiritually damaged.

FormalEvolve’s concentration results are therefore important. The method reduces the degree to which semantic successes are concentrated among the easiest problems. The appendix further shows that gains often appear in the baseline-ranked tail, although non-dominance remains: there are cases where the baseline succeeds and FormalEvolve misses.

That pattern is exactly what a useful production diagnostic should reveal. A candidate-repertoire system should not only report success rate. It should show where it expands coverage, where it duplicates effort, where repair calls are being spent, which candidates pass hard validation but fail semantic review, and which accepted candidates fail downstream execution.

This is where many enterprise AI deployments remain under-instrumented. They log the final answer and maybe a thumbs-up. FormalEvolve logs the search process: generator calls, repair composition, semantic hits, concentration, downstream proof success, and manual audit slices. The lesson is not that every business system needs a theorem-proving-grade evaluation suite. The lesson is that serious automation needs process telemetry, not just output telemetry.

Boundaries: what the paper directly shows, and what it does not

The paper directly shows that, in Lean 4 autoformalization under its fixed evaluation protocol, archive-based test-time search improves semantic hit coverage over matched no-archive controls, reduces concentration of semantic successes, and produces theorem-complete gains over the matched Hybrid control under a fixed prover budget.

It also shows that the gains are not free of qualifications.

Semantic labels rely on an LLM judge, CriticLean-Qwen3-14B. The authors therefore include manual audits. On a matched 50-problem ProofNet statement-stage audit, FormalEvolve, Hybrid control, and Sample have similar Faithful / Partial / Severe totals: 42/5/3, 43/4/3, and 43/3/4. That suggests archive search does not visibly amplify judge-positive semantic drift in this audited slice. It does not prove that every accepted candidate is faithful.

At the theorem-complete end, a 161-artifact audit across the original three generator-based methods finds 111 faithful, 17 partial, and 33 severe artifacts. FormalEvolve has the largest absolute faithful count, 41, while severe counts are numerically similar across methods. This is the correct boundary: proof success certifies the encoded formal statement, not necessarily its faithfulness to the source claim.

The downstream prover result is also protocol-dependent. The paper uses one fixed prover, Goedel-Prover-V2-32B, one fixed prompt template, and $B = 64$ prover attempts. Another prover, another proof-search policy, or another candidate ordering could change the result. That does not invalidate the finding. It simply means the finding is about a measured interface between generated statements and a particular downstream prover protocol.

For business translation, the uncertainty boundary is even wider. The paper is not evidence that every code agent, contract generator, or compliance assistant should immediately become an evolutionary archive system. It is evidence for a design pattern: when hard verification exists, semantic faithfulness is imperfect, and downstream utility is representation-sensitive, one-shot generation is an unnecessarily narrow operating mode.

The practical takeaway: stop asking AI for “the answer” too early

The most useful lesson from FormalEvolve is not that theorem proving needs more candidates. The deeper lesson is that AI systems often need a better definition of output.

For simple tasks, one answer is fine. For formal artifacts, one answer can be a premature commitment. The system should instead ask: What is the feasible set? Which candidates preserve intent? Which variants perform best downstream? Where do failures cluster? Which repairs create reusable structure?

That is a different architecture from a chatbot-style pipeline. It is closer to a search-and-validation loop:

Generate multiple candidates under a strict budget.
Use hard verifiers to filter infeasible artifacts.
Store validated candidates and near-useful structures.
Repair with bounded tool feedback.
Preserve diversity without letting diversity become decorative chaos.
Test candidates against the downstream tool or workflow.
Audit semantic faithfulness separately from execution success.

FormalEvolve’s contribution is to make that architecture concrete in Lean autoformalization and to measure it carefully enough that the result is interpretable. It does not abolish the need for judgment. It gives judgment better objects to inspect.

The old one-shot framing asks the model to guess the right formalization. FormalEvolve asks the system to explore a verified space of alternatives and let downstream evidence decide what works. That is a healthier relationship with AI: less prophecy, more engineering.

For business automation, this may be the more durable direction. The future is not just bigger models producing bigger single answers. It is systems that generate, verify, repair, remember, diversify, and test. Less oracle. More workshop.

And yes, the workshop is messier. That is usually where the useful things get built.

Cognaptus: Automate the Present, Incubate the Future.

Haijian Lu, Wei Wang, and Jing Liu, “FormalEvolve: Neuro-Symbolic Evolutionary Search for Diverse and Prover-Effective Autoformalization,” arXiv:2603.19828v3, 2026. ↩︎

From One Shot to Many: Why AI Should Stop Guessing and Start Exploring#

The real failure mode is not invalid output; it is valid-but-unhelpful output#

FormalEvolve turns autoformalization into repertoire search#

The mechanism is an archive, not just more sampling#

The main evidence: better coverage, less concentration, modest proof gains#

The proof result is useful, but it should be read carefully#

The ablations show repair is the workhorse, while diversity is sensitive#

The strongest-base check is a useful robustness signal, not a second thesis#

What this means for enterprise AI systems#

The business value is tail coverage and diagnosis, not just better averages#

Boundaries: what the paper directly shows, and what it does not#

The practical takeaway: stop asking AI for “the answer” too early#