Budget is where many impressive AI demos go to become ordinary software.

A model can reason longer. It can sample more. It can revise itself, compare candidates, aggregate outputs, and repeat the whole ritual until the invoice starts looking like a small infrastructure project. The obvious response is to ask whether the strongest model should simply do all of this work. Obvious, yes. Economically elegant, not quite.

The useful idea in Squeeze Evolve is not that AI systems should “think more.” We have had that slogan for a while, and it has become the enterprise version of telling a tired analyst to “be more strategic.” The paper’s sharper point is that test-time intelligence has stages, and those stages do not all deserve the same model.1

That turns the problem from model selection into allocation design.

Instead of asking, “Which model should solve this task?”, Squeeze Evolve asks a more operational question: where in the reasoning pipeline does expensive intelligence have the highest marginal value? The answer is not “everywhere.” In the paper’s experiments, strong models are most valuable for producing the initial candidate population and for handling uncertain recombination groups. Cheaper models can often aggregate already-good candidates. If candidates already agree, the system may not need an LLM at all. A shocking discovery, apparently: sometimes the cheapest inference call is the one you do not make.

The mistake is treating test-time scaling as one-model endurance

Test-time scaling is usually described as spending more computation during inference to improve output quality. In practice, this can mean majority voting, self-refinement, recursive aggregation, tree search, longer reasoning traces, or evolutionary search. These techniques differ on the surface, but they share the same basic intuition: a model’s first answer is not always the best answer, so give the system more chances to search.

Squeeze Evolve begins by placing these methods into a common evolutionary frame. A system starts with a population of candidate solutions. It selects promising candidates. It recombines them into new candidates. Then it repeats.

In that frame, many familiar methods become special cases:

Method Evolutionary interpretation Fitness signal Usual model assignment
Majority voting Generate once, select the largest answer cluster Consensus frequency Single model
Self-refinement One candidate critiques and rewrites itself Natural-language critique Single model
Recursive self-aggregation Groups of candidates are repeatedly aggregated Implicit model judgment Single model
Verifier-based evolution Candidates are scored by an external evaluator External reward or execution result Often one main model plus verifier
Squeeze Evolve Candidate groups are routed across model tiers Confidence or diversity proxy Multi-model, adaptive

This unification matters because it exposes a design flaw that is easy to miss when every method is discussed under its own branding. The bottleneck is not only whether the system searches. The bottleneck is which model performs which evolutionary operation.

If a single expensive model initializes candidates, scores uncertainty, recombines groups, and repeats the process, the pipeline may be capable but wasteful. If a single cheap model does all the work, the pipeline may be cheap but structurally limited. Squeeze Evolve argues for a middle position: spend premium intelligence where it changes the distribution of possible solutions, and use cheaper computation where the task has already become mostly aggregation.

That is the mechanism. The benchmarks matter, but they make sense only after this mechanism is clear.

Verifier-free evolution has a diversity problem before it has a cost problem

The paper focuses on verifier-free evolution. That means the system does not have a reliable external judge that can score candidates against ground truth during the search process. No execution oracle. No reward model trained for the task. No cheap correctness checker standing nearby with a clipboard.

This is attractive because external verification can be unavailable, slow, or expensive. But it creates a problem: if the model must judge its own candidates, it can only amplify what it already tends to recognize. Over repeated loops, the candidate population can collapse toward a narrower set of answers.

The paper’s Figure 2 is a motivation experiment, not merely decoration. Its purpose is to show that single-model open-loop evolution can lose semantic diversity and reduce the population’s pass@ ceiling. In plainer language: even if the final selected answer looks polished, the pool of possible correct answers may have become smaller. The system is not exploring better. It is becoming more consistent. These are not the same thing, though software dashboards have been known to confuse them.

The authors compare single-model baselines with Squeeze Evolve on GPQA-Diamond-style settings and observe that single-model evolution loses diversity after early loops, with a corresponding decline in pass@. Multi-model orchestration remains higher and flatter on both diversity and pass@. The interpretation is not that diversity is a decorative nice-to-have. Diversity is the search ceiling. Once the right candidate disappears from the population, no amount of elegant aggregation can reliably recover it.

This is the first mechanism behind the paper’s business relevance. Many AI workflow systems fail not because the final model is weak, but because the pipeline narrows too early. They let one model’s priors dominate the entire process. After that, the system can spend more tokens producing a more fluent version of a mistake. Finance teams, at least, will appreciate the honesty of paying extra for wrongness.

Initialization is where expensive intelligence earns its rent

Squeeze Evolve’s first design choice is simple: initialize the candidate population with the stronger model.

The paper’s Table 2 is best read as an ablation-style test of role assignment. It asks whether the stronger model should be used at the beginning or later in recombination. The answer is not symmetric.

Model-pair setting Strong initialization + weak aggregation Weak initialization + strong aggregation Practical interpretation
GPT-OSS-120B / GPT-OSS-20B on HMMT 2025 89% 85% Strong initialization helps, though the gap is moderate
Qwen3-4B-Thinking / Qwen3-4B-Instruct on AIME 2025 88% 65% Bad starting populations are much harder to rescue

The second row is the important one. A 23-point gap is not a small tuning detail. It says that the initial candidate distribution can dominate the final result. If the system begins with weak or poorly grounded candidates, later recombination may only reorganize insufficient material.

This result should make product teams slightly less obsessed with “the final answer model.” In many workflows, the expensive model may be better used upstream: generating the candidate set, creating alternative plans, grounding multimodal observations, or producing the first structured decomposition. Once the system has a strong candidate population, downstream work can often be cheaper.

That is the first allocation rule:

Use the strongest model to create the search space, not merely to polish the last answer.

This is especially relevant for enterprise workflows where the expensive part is not a single response but repeated processing at scale: document extraction, claim checking, compliance screening, code repair, market analysis, and multi-step research agents. The cost question is not “Can GPT-whatever do it?” The question is whether the system is paying frontier-model prices at stages where frontier capability no longer changes the result.

Cheap recombination works only when the candidate set is already strong

The next mechanism is less glamorous but more useful. Squeeze Evolve shows that weaker models can be effective aggregators when the candidate set contains good candidates.

Figure 3(a) and Appendix F serve as aggregation-sensitivity tests. They vary the number of correct trajectories inside a candidate group and observe how aggregation accuracy changes. The pattern is intuitive but important: if no candidate in the group is correct, neither model can reliably recover a correct answer. If all four candidates are correct, both models approach near-perfect aggregation. The large model has an advantage in intermediate cases, but the extremes reveal the routing opportunity.

The result is not “cheap models are as good as expensive models.” That would be the kind of comforting nonsense that makes procurement meetings longer. The actual result is narrower and more useful:

Cheap models can be good enough when the group already contains strong, consistent candidates.

That leads to the second allocation rule:

Route easy recombination groups to cheaper models; reserve strong models for uncertain or conflicting groups.

The remaining question is how the system knows which groups are easy or hard without a verifier. Squeeze Evolve uses two lightweight proxies.

First, group confidence uses token-level log-probabilities. If the model’s predictive distribution is peaked, confidence is higher; if it is flatter, confidence is lower. The paper treats low confidence as a signal that a group may need stronger recombination. When the scoring model is the same as the generating model, this signal is essentially free because the log-probabilities are already produced during generation. Cross-model confidence requires an additional prefill-only pass, but not full decoding.

Second, answer diversity counts distinct final answers in a group. This is useful when APIs do not expose log-probabilities, as in the paper’s ARC-AGI-V2 experiment. High diversity means disagreement; disagreement often means the group is harder.

The paper’s Figure 3(b), Appendix G, and related confidence analyses should be read as routing-signal validation. They do not prove that confidence is a perfect correctness estimator. They show something more modest and operationally sufficient: groups containing correct trajectories tend to maintain higher group confidence than all-incorrect groups across loops and model settings. That is enough to support a routing policy.

This is where Squeeze Evolve becomes a system rather than a metaphor.

The algorithm is a three-tier labor market for reasoning

Squeeze Evolve organizes recombination into three tiers:

Candidate-group condition Routing choice Why it is economically sensible
Strong consensus or already easy Lightweight non-LLM aggregation Do not pay for synthesis when agreement is already present
High confidence / low diversity but still needs recombination Cheaper model The group likely contains enough signal for low-cost aggregation
Low confidence / high diversity Stronger model Uncertainty is where premium reasoning has higher marginal value

In the main math, coding, and vision experiments, the system uses group confidence as the routing signal. In ARC-AGI-V2, where the Gemini API does not expose log-probabilities, it uses answer diversity instead. In circle packing, it uses group confidence with fitness-weighted selection and an accumulate rule that preserves candidates across generations.

That variation matters. Squeeze Evolve is not a single brittle recipe; it is a framework with different operator settings depending on the task. The table of operator instantiations in the paper is an implementation detail, but it carries an important business message: routing systems need task-specific instrumentation. The routing signal for grid puzzles may not be the routing signal for code generation, and the update rule for scientific discovery may not be the update rule for multiple-choice QA.

A simple diagram captures the logic:

Strong model initializes candidates
Candidate groups are formed
Confidence or diversity estimates group difficulty
Easy consensus  → no LLM aggregation
Easy synthesis  → cheap model
Hard synthesis  → strong model
New candidate population
Repeat

The model is not thinking alone. The system is allocating intelligence.

The reasoning and coding results show a cost-capability shift, not universal dominance

The paper evaluates Squeeze Evolve across math, coding, scientific QA, multimodal vision, ARC-AGI-V2, and circle packing. The cleanest reading is not “Squeeze Evolve wins everything.” It does not. The better reading is that it often preserves or improves accuracy while moving the cost curve left.

Representative results from the main reasoning and coding table show the pattern:

Benchmark RSA baseline with strong model Squeeze Evolve model pair Accuracy change Cost change
AIME 2025 Qwen3-30B-A3B-Thinking: 89.2%, $0.94/problem Qwen3-30B-A3B-Instruct + Thinking 90.7% $0.66/problem, 1.4× savings
HMMT 2025 GPT-OSS-120B: 89.7%, $0.41/problem GPT-OSS-20B + GPT-OSS-120B 92.0% $0.25/problem, 1.6× savings
GPQA-Diamond Qwen3-30B-A3B-Thinking: 74.0%, $0.57/problem Qwen3-30B-A3B-Instruct + Thinking 75.9% $0.32/problem, 1.8× savings
LiveCodeBench V6 GPT-OSS-120B: 75.9%, $0.44/problem GPT-OSS-20B + GPT-OSS-120B 75.6% $0.22/problem, 2.0× savings

For open-source homogeneous pairs, the authors report that Squeeze Evolve matches or exceeds the Model 2-only baseline while costing roughly 1.4–2.1× less across tested settings. For heterogeneous open-source plus closed-source pairs, savings can reach 3.3×, though aggressive routing may reduce accuracy by several points. That trade-off is not a failure; it is the point. A routing system creates an operating curve. Product teams then choose the point on the curve that matches service-level requirements, margin targets, and user tolerance for errors.

The most important detail is that no single model pair dominates all tasks. Qwen configurations lead in some reasoning settings; GPT-OSS configurations look stronger in others. This keeps the paper from becoming a model leaderboard. The contribution is the orchestration logic, not a claim that one model family is secretly magical. We have enough magical thinking in AI already.

Vision looks like an initialization problem more than a continuous perception problem

The multimodal results are one of the more interesting parts of the paper because they expose a useful decomposition.

For MMMU-Pro and BabyVision, the expensive vision-capable model is used to initialize the candidate population. Later recombination can be handled by cheaper models, including a text-only model in the heterogeneous setup. The paper reports that on MMMU-Pro, Squeeze Evolve with Qwen3.5-35B-A3B as the cheaper text-only Model 1 and Kimi-2.5-Thinking as Model 2 reaches 79.06% at $0.46/problem versus the Kimi-2.5-Thinking RSA baseline at 78.58% and $1.04/problem, a 2.3× saving. On BabyVision, the analogous setup gets 41.27% at $0.83/problem versus 43.23% at $2.05/problem, a 2.5× saving with a noticeable accuracy decline.

That difference matters. The MMMU-Pro result supports a strong story: once visual information has been grounded into candidate reasoning traces, later aggregation may not need image processing. The BabyVision result is more mixed: the cost saving is large, but the accuracy drop is real. So the proper interpretation is not “vision models are unnecessary after the first step.” It is more precise:

In some multimodal workflows, expensive visual grounding may be most valuable during initialization, while later reasoning over candidate traces can be cheaper. The strength of this decomposition depends on how much visual detail must be revisited during later reasoning.

For business systems, this is directly relevant. Invoice extraction, visual inspection, medical imaging support, property assessment, and UI-agent workflows may not require the most expensive multimodal model at every step. But the decomposition must be validated per task. If later reasoning requires returning to subtle visual evidence, text-only recombination may erase information. The paper’s evidence is promising; it is not a license to remove perception from every multimodal pipeline and call it architecture.

ARC-AGI-V2 shows routing can work even without log-probabilities

ARC-AGI-V2 is useful because the paper cannot use the same confidence signal there. Gemini does not expose log-probabilities through the relevant API, so Squeeze Evolve uses answer diversity instead.

This experiment is best read as a robustness and extension test of the routing principle. The question is whether the framework still works when the preferred confidence signal is unavailable. The answer is yes, at least on this public evaluation setup.

The reported ARC-AGI-V2 results are strong:

Method Accuracy Cost per task Notes
RSA with Gemini 3.1 Pro 93.3% $28.85 Strong-model-only evolutionary baseline
Squeeze Evolve with Gemini 3.1 Pro and diversity routing 97.5% $7.74 No code execution
Squeeze Evolve with Gemini 3.0 Flash + Gemini 3.1 Pro + lightweight aggregation 97.5% $5.93 Three-way routing
Imbue + Gemini 3.1 Pro 95.1% $8.71 Code execution / program synthesis comparison
Confluence Lab 97.9% $11.77 Code execution / program synthesis comparison

Two points should not be blurred.

First, compared with RSA using Gemini 3.1 Pro, Squeeze Evolve substantially reduces cost and improves accuracy. That supports the paper’s core claim that model assignment and lightweight aggregation can move the cost-capability frontier.

Second, compared with code-execution systems, Squeeze Evolve is competitive in cost and accuracy on the public evaluation set, but the comparison is not the same experimental condition. Code-execution systems use a different kind of external tool feedback. Squeeze Evolve’s result is impressive precisely because it does not rely on code execution in that setup, but it should not be inflated into a universal statement that verifier-free methods now dominate verifier-based methods.

Good interpretation is cheaper than hype. It also ages better.

Circle packing tests whether confidence can guide discovery, not just answer selection

The circle-packing experiment moves beyond ordinary question answering. The task is to pack 26 non-overlapping circles in a unit square to maximize the sum of their radii. This is closer to open-ended optimization and scientific discovery than to multiple-choice reasoning.

Here Squeeze Evolve uses GPT-OSS-20B and GPT-OSS-120B, group confidence, fitness-weighted selection, percentile routing, and an accumulate rule that preserves candidates across generations. The paper compares the final score with other evolutionary systems:

Method Model setup Score
ShinkaEvolve Ensemble including Claude Sonnet-4, GPT-4.1 variants, and o4-mini 2.635982
Squeeze Evolve GPT-OSS-120B + GPT-OSS-20B 2.635896
AlphaEvolve Gemini-2.0 Pro + Flash 2.635862
OpenEvolve Gemini-2.0 Flash + Claude 3.7 Sonnet 2.634292

This is not a clean victory lap. ShinkaEvolve remains slightly higher in the reported table. But Squeeze Evolve is extremely close and slightly above AlphaEvolve in this setting, while operating without in-flight execution feedback or verifier output during evolution.

The appendix helps explain why this result is plausible. The evolved algorithm combines diverse initialization strategies, LP-based radius assignment for fixed centers, simulated annealing, SLSQP refinement, and feasibility post-processing. The relevant point for Squeeze Evolve is not that the final program contains no optimization machinery. It does. The point is that the evolutionary search process uses model-intrinsic confidence as the selection/routing signal rather than repeatedly executing candidates and feeding objective scores back into the loop.

That distinction matters for business interpretation. Confidence is a weak surrogate, not a replacement for real measurement when measurement is cheap. If you can cheaply execute code, run tests, simulate outcomes, or compare against known labels, you should probably use that information. Very radical, I know. But when external verification is expensive, delayed, or unavailable, model-intrinsic signals may still guide useful search.

The system result is not an appendix detail; it is the business case

Many AI papers treat systems engineering as plumbing. Squeeze Evolve cannot afford that luxury because the value proposition is economic. Routing only matters if the overhead does not eat the savings.

The paper reports two system findings.

First, routing overhead is small. The authors isolate overhead by comparing standard RSA with a Squeeze Evolve variant that enables confidence scoring and threshold computation but forces all aggregation calls to the strong model. This is a conservative test because normal Squeeze Evolve also routes work to the cheaper model. Across tested models, average routing overhead is reported at 2.4–4.3% end-to-end. The appendix shows benchmark-level variation, with higher relative overhead in short-generation settings such as GPQA, where fixed scoring costs loom larger.

Second, fixed-budget throughput can rise sharply when serving pools are sized properly. Under the same total GPU budget, the Qwen3-30B/235B pair reaches roughly 4–10× throughput speedups, while the GPT-OSS-20B/120B pair reaches about 1.4–3.4×. The difference is explained by model asymmetry: if the small model is much cheaper to serve relative to the large one, routing creates more throughput leverage.

This is where the paper becomes operationally serious. It does not merely say, “Use a cheaper model sometimes.” It argues that the serving system must be co-designed around routing:

System component Paper’s design choice Business meaning
Confidence scoring Prefill-only scoring rather than full decoding Routing signal is cheaper than generating another answer
Custom confidence engine Return a scalar instead of full token-level logprob tensors Avoid turning metadata transfer into a bottleneck
Latency-matched pools Allocate GPUs so cheap and expensive pools finish together Prevent the faster pool from idling while the slower pool dominates latency
Requests/sec metric Compare completed requests under fixed GPU budgets Measure deployment throughput, not just theoretical token savings

The custom confidence engine is especially revealing. The paper reports that the native path can transfer about 13 MB of token-level logprob data per request, while the custom path returns around 100 bytes. It also enables confidence scoring at a scale where the native path runs out of memory. That is not an academic footnote. It is the difference between “nice routing idea” and “deployable routing system.”

For SaaS builders, this is the part to underline. Model orchestration is not a prompt-engineering trick. It is a serving architecture.

What the evidence supports, and what it does not

A clean way to read the paper is to separate the claims by evidentiary role.

Paper component Likely purpose What it supports What it does not prove
Unified evolutionary formulation Conceptual framework Many test-time methods share operators: initialization, selection, recombination, fitness That all tasks should use the same operator settings
Figure 2 diversity/pass@ analysis Motivation evidence Single-model verifier-free evolution can narrow search capacity That multi-model routing always preserves diversity
Table 2 role assignment Ablation-style evidence Strong initialization can matter more than strong later aggregation That initialization is always the only important stage
Figure 3 and Appendices F/G Routing-signal validation Candidate correctness and confidence/diversity correlate enough to guide routing That confidence is a reliable verifier
Reasoning/coding cost tables Main benchmark evidence Squeeze Evolve often preserves accuracy while reducing API cost That aggressive routing has no quality trade-off
Vision results Extension to multimodal tasks Visual grounding may be concentrated in initialization for some tasks That text-only later stages are always safe
ARC-AGI-V2 Robustness to missing logprobs Diversity can substitute for confidence in a routing rule That ARC results automatically generalize to private or harder sets
Circle packing Discovery-style extension Confidence can sometimes guide open-ended search without verifier feedback That verifier-free search replaces execution-based discovery when execution is cheap
System throughput tests Deployment evidence Routing can improve fixed-budget serving throughput when pools are sized correctly That savings appear without engineering work

This table is less exciting than a one-line “new state of the art” claim. It is also more useful.

The business lesson is model governance at the step level

The immediate business implication is not that every company should copy Squeeze Evolve exactly. Most companies do not run ARC-AGI solvers or circle-packing discovery loops in production, though some product roadmaps occasionally sound like they were generated by one.

The implication is that AI workflow design should move from model-level governance to step-level governance.

Model-level governance asks:

  • Which model do we use?
  • What is the average cost per request?
  • What is the observed task accuracy?

Step-level governance asks:

  • Which steps need premium capability?
  • Which steps only need cheap synthesis?
  • Which steps can use deterministic or lightweight aggregation?
  • What signal determines escalation?
  • What is the cost of measuring that signal?
  • How does routing affect latency, throughput, and reliability?

That is a more mature set of questions. It also matches how real operations work. A bank does not assign its most senior analyst to every spreadsheet cell. A logistics company does not use the same vehicle for every package. A hospital does not send every patient directly to the most specialized surgeon. But AI systems often route everything to the biggest model because it is simpler and looks safer. Squeeze Evolve shows why that simplicity can be expensive and sometimes counterproductive.

For Cognaptus-style automation systems, the practical pathway is clear:

Workflow stage Premium model likely useful? Cheaper model likely useful? Non-LLM logic likely useful?
Initial extraction from ambiguous documents High Medium Low to medium
Candidate plan generation High Medium Low
Aggregating similar candidate answers Medium High Medium
Handling consensus answers Low Low High
Escalating conflicting outputs High Low to medium Medium
Formatting final structured output Low High High
Monitoring confidence and disagreement Medium Medium High

This is not a recommendation to always downgrade. It is a recommendation to instrument the pipeline so downgrading, escalating, and skipping are explicit decisions rather than vibes.

The boundary: routing needs signals, structure, and enough scale to matter

The paper’s results are strongest when the workflow has three properties.

First, there must be a meaningful candidate population. Squeeze Evolve is not designed for a one-shot answer where there are no candidates to compare. It becomes useful when the system samples, aggregates, revises, or searches.

Second, there must be a usable proxy for group difficulty. Group confidence works when token log-probabilities or prefill scoring are available. Diversity works when answers can be extracted and compared. If neither signal is reliable, routing becomes decorative architecture. Decorative architecture is popular, but the margins are poor.

Third, the cost difference between model tiers must be large enough to overcome orchestration complexity. The Qwen throughput gains are larger than the GPT-OSS gains partly because the effective serving asymmetry is larger. If the cheap and expensive models are not very different in cost or latency, routing may still help quality or diversity, but the economic case weakens.

There are also task boundaries. If correctness can be cheaply verified, use verification. If the task requires continuous access to source evidence, avoid routing later stages to models that cannot inspect that evidence. If a workflow is safety-critical, confidence should support escalation, not replace validation. And if the system is small-volume, the engineering cost of multi-model serving may exceed the inference savings. Not every problem needs a distributed intelligence allocator. Some need a good prompt and a cup of coffee.

The real shift is from “more reasoning” to “better allocation”

Squeeze Evolve is valuable because it reframes test-time scaling as an allocation problem.

The paper does not say that strong models are unnecessary. It says they are scarce resources inside a pipeline. Use them where they shape the search space. Use them where uncertainty is high. Do not waste them where consensus already exists or where a cheaper model can aggregate strong candidates.

That is a more durable lesson than any individual benchmark number. Model prices will change. API features will change. GPT-5 mini, Gemini 3.1 Pro, Kimi, Qwen, and GPT-OSS will all eventually look like artifacts from a particular historical layer of the AI stack. The allocation logic will remain useful because it addresses a structural problem: inference systems are becoming multi-step, multi-model, and economically constrained.

The old question was: How smart is the model?

The better question is: Where should intelligence be spent?

Squeeze Evolve gives one answer: start strong, route by uncertainty, aggregate cheaply when possible, and skip the model when agreement is already enough.

It is not the end of test-time scaling. It is a sign that test-time scaling is growing up. The system is no longer just thinking harder. It is learning when thinking harder is worth paying for.

**Cognaptus: Automate the Present, Incubate the Futur


  1. Monishwaran Maheswaran et al., “Squeeze Evolve: Unified Multi-Model Orchestration for Verifier-Free Evolution,” arXiv:2604.07725v2, April 10, 2026, https://arxiv.org/abs/2604.07725↩︎