Scheduling a factory, routing a fleet, pricing airline seats, allocating scarce capacity: these are not “write me a Python script” problems with nicer stationery. In real operations research, the useful answer is not merely a correct mathematical model. It is a method that stays feasible, keeps solution quality high, and finishes before the business context has expired.
That is the useful irritation behind FrontierOR, a new benchmark for testing whether LLMs can design efficient algorithms for large-scale optimisation problems rather than merely translate business prose into solver code.1 Its main finding is not that current models are useless. That would be too easy, and also false. The sharper result is that the strongest models can often produce runnable programs, yet still fail the actual operational test: valid, high-quality, fast solutions on large instances.
The distinction matters because many enterprise AI demos quietly stop at the point where the code runs. FrontierOR keeps going. Annoying, yes. Necessary, also yes.
The main evidence: execution is no longer the interesting bottleneck
The paper’s central evidence is Table 2, which reports one-shot performance across 180 FrontierOR tasks and a 50-task Hard subset. This is the main evidence, not a decorative leaderboard. It measures four stages of usefulness: whether the generated program executes, whether its output is feasible, whether its solution is within 1% of the Gurobi reference, and whether it reaches that quality no slower than Gurobi. The last metric is quality–time efficiency, or QTE.
A few numbers do most of the intellectual work.
| Model | Full execution | Full feasibility | Full solution quality | Full QTE | Hard feasibility | Hard solution quality | Hard QTE |
|---|---|---|---|---|---|---|---|
| Claude Opus 4.6 | 0.93 | 0.62 | 0.48 | 0.31 | 0.60 | 0.44 | 0.32 |
| GPT-5.3-Codex | 0.98 | 0.60 | 0.48 | 0.26 | 0.49 | 0.30 | 0.18 |
| Gemini 3.1 Pro | 0.93 | 0.61 | 0.52 | 0.25 | 0.64 | 0.44 | 0.22 |
| DeepSeek-R1 | 0.74 | 0.42 | 0.31 | 0.17 | 0.37 | 0.20 | 0.11 |
| Qwen3-Coder-Plus | 0.60 | 0.26 | 0.20 | 0.09 | 0.21 | 0.12 | 0.07 |
| LLaMA-4-Maverick | 0.47 | 0.18 | 0.13 | 0.06 | 0.13 | 0.07 | 0.02 |
The obvious headline would be that frontier models beat cheaper or open-source alternatives. True, but not the point. The more useful headline is that GPT-5.3-Codex reaches a 0.98 execution rate on both Full and Hard, yet falls to 0.18 QTE on Hard. The program runs. The business still may not have an answer.
Claude Opus 4.6 posts the highest QTE: 0.31 on Full and 0.32 on Hard. Gemini 3.1 Pro has the best Hard feasibility at 0.64 and ties Claude on Hard solution quality at 0.44. These are respectable scores for a hard benchmark. They are not a basis for handing an AI agent the weekly production plan and taking an early lunch.
QTE is deliberately unforgiving. A generated program must return a feasible solution, stay within 1% of the Gurobi reference, and be at least as fast as Gurobi. That triple condition is exactly why the metric is useful. Businesses do not purchase “almost feasible” schedules, nor do they celebrate routes that are optimal sometime next quarter.
FrontierOR is an algorithm-design exam, not a modelling quiz
The benchmark is designed to avoid the comfortable shortcut where an LLM turns prose into a mixed-integer program, calls a commercial solver, and claims adulthood.
FrontierOR contains 180 tasks derived from operations research papers published between 1992 and 2025, with 69 papers from 2020 or later. The tasks span multiple optimisation paradigms, problem classes, and application domains including transportation, energy, supply chain, and healthcare. The smallest large instances have a median of roughly 40,000 decision variables and 18,000 constraints; larger sizes scale further. Gurobi fails to reach optimality within one hour on 46% of large instances. That detail is important: the baseline itself is not a toy.
The LLM-facing task description is intentionally operational. The model receives natural-language prose plus input and output schemas. It does not receive the mathematical formulation, reference solver implementation, or algorithmic hints. The hidden evaluator contains the formulation, expert-verified Gurobi implementation, reference solutions, and a standalone feasibility checker.
This design choice is doing real methodological work. It tests whether the model can infer latent structure from business language: decomposability, network coupling, sparsity, assignment structure, time windows, capacity constraints, scenario structure. In short, the benchmark asks whether the model can see the optimisation problem inside the operational story.
That is closer to how enterprise optimisation work actually arrives. Nobody from procurement walks into the room and says, “Please exploit the block-angular structure of this stochastic mixed-integer program.” They say deliveries are late, warehouses are full, labour rules are annoying, and finance wants the answer by Friday. Then the suffering begins.
The benchmark’s plumbing is part of the contribution
The dataset construction is not just scaffolding around the experiment. It is one of the paper’s main contributions.
Each task is built from a source paper through formulation extraction, instance specification, solver implementation, natural-language task generation, and feasibility-checker construction. The authors use automated extraction and implementation support, but they do not leave the benchmark to vibes and hope. Fifteen OR experts audit the mathematical formulations, problem descriptions, Gurobi code, and feasibility checkers over a three-week multi-turn review process.
The feasibility checkers are especially important. They verify hard constraints, variable domains, and bounds. They also recompute the objective from the submitted solution, which prevents the classic “trust me, my objective is excellent” manoeuvre. A generated algorithm that reports a gorgeous objective attached to an invalid solution receives no objective credit. This is rude in the most productive way.
The appendix details are implementation details in the paper, but they matter for interpretation. FrontierOR is not only evaluating LLMs; it is also proposing what a serious evaluation harness for optimisation agents should look like. Hidden references, independent feasibility checking, tiny-instance gates, large-instance scoring, and runtime comparison are not academic fussiness. They are what separates an optimisation system from an expensive autocomplete session.
The metric punishes the right failure
QTE can look harsh until one remembers what optimisation is for. A feasible but poor solution is operationally weak. A high-quality solution that arrives too late is also weak. A fast infeasible solution is not a solution; it is a confidently formatted liability.
| Evaluation layer | What it catches | Why it matters in business |
|---|---|---|
| Execution rate | Runtime errors and broken code | Basic automation reliability |
| Feasibility | Violated constraints, invalid assignments, impossible plans | Prevents operational nonsense from entering workflows |
| Solution quality | Objective gap against the reference | Measures whether the answer is economically competitive |
| QTE | Quality and speed together | Captures whether the method is useful under decision deadlines |
This is why the misconception around LLM optimisation needs correction. The hard part is not simply “natural language to solver-ready formulation.” That is a meaningful capability, but it is not the end of the job. FrontierOR shows that the harder bottleneck is scalable algorithm design: choosing when to decompose, when to relax, when to use heuristics, when to warm-start, when to call a solver, and when not to.
The “when not to” is underappreciated. A direct solver call can be perfectly rational on a small instance and disastrous on a large one. Scale changes the nature of correctness. The same model formulation can be mathematically valid and operationally useless if the algorithmic wrapper is naive.
The leaderboard is partly a coverage story
Appendix D is best read as a robustness and sensitivity analysis for the headline metrics. The binary Table 2 metrics answer: how often does the model clear a strict threshold? The appendix asks a subtler question: when a model returns feasible solutions, how far away is it from Gurobi in quality and time?
This matters because thresholded metrics hide magnitude. A solution 1.1% away from the reference and a solution 40% away both fail the same binary 1% quality test. The continuous analysis shows that frontier models are often close to Gurobi in quality on their feasible cases. On the Full set, the three frontier models have continuous solution-quality differences near zero: Claude at 0.001, GPT-5.3-Codex at 0.014, and Gemini at 0.021. The paper interprets this as average objective quality within a few percent of the Gurobi reference.
Runtime tells a different story. The continuous time-efficiency values for leading models correspond to roughly 3×–5× speedups in reaching the same 1% quality target. So the result is not “LLMs always produce worse optimisation code.” It is more interesting: when they solve the right cases, they can be fast and competitive. The problem is coverage. They do not yet solve enough of the task distribution reliably.
The pairwise shared-task analysis reinforces this. Conditional on both models solving the same task, solution-quality differences are small for almost every model pair. Runtime separates the models more sharply, and frontier models generally reach shared quality targets faster than weaker models. DeepSeek-R1 even matches or slightly outperforms GPT-5.3-Codex and Claude on shared tasks in some quality and speed comparisons. But its aggregate score remains lower because it solves fewer tasks.
For business readers, this distinction is not cosmetic. If an AI optimiser works beautifully on 30% of your planning cases and fails unpredictably on the rest, the correct procurement question is not “How good is it when it works?” The correct question is “Can we detect the cases where it will work before we trust it?” Coverage is not a footnote. It is the deployment problem.
The mechanism: solver calls are the baseline, not the strategy
Figure 3 is diagnostic evidence. Its purpose is to explain how model behaviour differs, not merely to decorate the score table.
The algorithm-family analysis classifies generated code into monolithic solver calls, decomposition, constructive heuristics, local search or metaheuristics, and matheuristic or hybrid methods. The pattern is blunt. LLaMA-4-Maverick relies on monolithic solver calls in 99% of tasks. Qwen3-Coder-Plus does so in 72%. Claude Opus 4.6 is more balanced: 37% monolithic solver calls, 27% local search or metaheuristics, and 27% matheuristic or hybrid methods.
That balance helps explain why Claude leads on QTE. It is not simply “better at coding.” It is more willing to choose algorithmic templates that trade exactness, speed, and structure more intelligently. The stronger models more often produce decomposition, local-search, and hybrid methods. The weaker models reach for the solver as a universal hammer. To be fair, the hammer is excellent. It is just not a strategy.
The failure-mode analysis makes the same point from another angle. For most models, formulation design flaws dominate. Weaker models also show more constraint-specification errors and interface or schema violations. Stronger models shift the bottleneck later: fewer formulation mistakes, more failures in heuristic search and refinement. That is progress, but it is not completion. The model graduates from misunderstanding the problem to choosing an insufficient method. A promotion, technically.
The case studies make this concrete. In an airline choice-based network revenue-management task, both GPT-5.3-Codex and Claude build a column-generation structure. The difference is in pricing. GPT uses a heuristic local search and exits too early, leaving the active set too thin; about 40% of large instances violate leg capacity. Claude uses exact pricing with enumeration and terminates only with an optimality certificate, matching Gurobi with near-zero average gap. Same broad formulation, different algorithmic discipline.
In a sum-of-squared-loads scheduling task, GPT builds and solves a mixed-integer quadratic program unconditionally. On large instances, the solver stalls and returns a poor incumbent, with an average relative gap of roughly 310%. Gemini first builds a near-optimal heuristic warm start, calls the exact solver only behind size and time guards, adds symmetry breaking, and returns the better of the heuristic and solver result. It matches reference quality and finishes about 534 seconds faster.
The lesson is not “never use solvers.” That would be a charmingly bad takeaway. The lesson is to use solvers as components inside scale-aware algorithms: guarded, warm-started, decomposed, bounded, and checked.
Self-evolution improves the floor, but it is not magic with a progress bar
The paper’s self-evolution experiments are an exploratory extension and a comparison among agentic search frameworks. They are not a universal proof that test-time evolution solves enterprise optimisation. The authors select the top 40% most challenging tasks in the Hard subset based on weak one-shot performance, use GPT-5.3-Codex as the shared backbone, initialise each framework from the same seed program, and allow a 30-attempt budget.
The result is meaningful.
| Method | Execution | Feasibility | Solution quality | QTE |
|---|---|---|---|---|
| One-shot | 0.80 | 0.45 | 0.18 | 0.15 |
| EoH | 0.78 | 0.72 | 0.43 | 0.33 |
| OpenEvolve | 1.00 | 0.92 | 0.61 | 0.49 |
| CORAL | 1.00 | 1.00 | 0.67 | 0.50 |
CORAL performs best, reaching 1.00 feasibility, 0.67 solution quality, and 0.50 QTE. OpenEvolve is close behind at 0.49 QTE. EoH improves over one-shot but is less stable.
Figure 4 adds trajectory evidence. Across frameworks, speed advantage crosses the Gurobi baseline within roughly five attempts. Solution quality is harder. Only CORAL consistently crosses the Gurobi reference gap from around attempt 16 onward. That asymmetry is the important part: making code faster is often easier than making the algorithm good enough. Enterprise teams may recognise this pattern from every “quick optimisation improvement” that later turns into a month of constraint debugging. History has hobbies.
The appendix case studies explain three improvement pathways. OpenEvolve shows depth: small, in-family refinements accumulate around a column-generation skeleton. EoH shows breadth: it can jump from a heuristic to a heuristic–exact hybrid. CORAL shows migration: separate agents produce reusable algorithmic components, such as heuristics, pattern enumeration, and residual exact assignment, which are assembled through shared memory.
This is promising because hard optimisation often requires exactly that mixture: local refinement, occasional method-family changes, and recombination of partial techniques. But the boundary is clear. These are selected hard tasks, a limited attempt budget, and benchmark-controlled feedback. Production systems must still deal with private data messiness, changing constraints, unreliable inputs, and people who rename columns because it “looked cleaner.”
What businesses should actually infer
The paper directly shows that current LLMs are not reliable general-purpose optimisation algorithm designers under one-shot conditions. It also shows that test-time feedback and agentic search can materially improve performance on selected hard tasks. Cognaptus’ business interpretation is narrower and more useful: evaluate optimisation agents as optimisation systems, not as chatbots that happen to emit code.
| Layer | What the paper shows | Business interpretation | What remains uncertain |
|---|---|---|---|
| One-shot generation | Strong models often execute code but much less often meet feasibility, quality, and speed together | Do not accept runnable solver demos as evidence of deployment readiness | Performance may differ on proprietary task families and curated internal templates |
| Benchmark design | Hidden references and feasibility checkers expose invalid or slow solutions | Build internal evaluators before buying or deploying optimisation agents | Checker construction itself can be costly and domain-specific |
| Algorithm families | Hybrid, decomposition, and search methods often outperform monolithic solver calls | Prefer agents that can choose method families, not just call Gurobi from generated Python | Automatic classification of good algorithmic choices remains imperfect |
| Self-evolution | Feedback-driven search improves feasibility and QTE on selected hard tasks | Use development sets, held-out tests, and iterative candidate search for hard planning workflows | Test-time compute budgets and stability need production validation |
| Continuous analysis | Coverage explains much of the model gap | Measure where the agent works, not only how well it works when it works | Coverage estimation needs enough representative internal cases |
A serious enterprise deployment path therefore looks less like “connect LLM to solver” and more like an engineering pipeline.
First, collect representative optimisation tasks from actual business workflows: routing exceptions, production schedules, replenishment plans, crew assignments, capacity-allocation cases. Second, build independent feasibility checkers that encode hard constraints. Third, maintain strong baselines, including commercial solvers, existing heuristics, and human-designed workflows. Fourth, score agents on feasibility, objective quality, runtime, and coverage. Fifth, separate development feedback from held-out evaluation. Sixth, allow iterative search only where the economics justify the extra computation.
That last condition matters. Test-time evolution is not free. A 30-attempt budget may be justified for weekly network planning or high-value capacity allocation. It is probably not justified for a low-margin micro-decision repeated thousands of times per hour unless the improvement is very large and stable. Optimisation has always been about trade-offs; adding agents does not repeal arithmetic.
The boundary conditions are practical, not ceremonial
FrontierOR has limitations that matter for use.
The tasks come from reproducible OR literature, which is exactly what gives the benchmark discipline. It also means the benchmark may underrepresent domains where data, formulations, or solver implementations are not publicly recoverable. Many enterprise optimisation problems live in that swamp.
The Gurobi reference is a strong and useful baseline, but it is not a claim of universal optimality. In some large instances Gurobi itself does not prove optimality within the one-hour budget; the reference may be the best feasible objective found under the benchmark conditions. That is appropriate for practical comparison, but readers should not confuse “beats Gurobi under this setup” with “solves the mathematical problem absolutely.”
The self-evolution experiment is deliberately narrower than the full benchmark. It uses selected hard tasks, a shared backbone model, and a fixed candidate budget. The result supports the value of feedback-driven search, especially CORAL-style multi-agent recombination, but it does not prove that autonomous agents can safely own all large-scale planning workflows.
Finally, the benchmark evaluates generated programs under a controlled environment: one CPU core, a fixed software stack, disabled network access, and standardised input-output schemas. Production environments add integration risk, data latency, governance rules, exception handling, and the ancient enterprise ritual of discovering that “capacity” means three different things in three systems.
These boundaries do not weaken the paper. They make the result more usable. The benchmark is not claiming to solve enterprise optimisation. It is showing how much harder the evaluation has to be.
The real contribution is a stricter definition of “works”
FrontierOR’s most valuable contribution is not just another leaderboard. It raises the standard for what an AI optimisation agent must prove.
Runnable code is not enough. A correct formulation is not enough. A fast heuristic is not enough. A beautiful objective attached to an infeasible plan is definitely not enough, though it may make a nice dashboard screenshot if nobody checks.
The benchmark asks for the combination that businesses actually need: feasibility, solution quality, and time efficiency on large instances, under hidden evaluation, against a serious baseline. Current models are improving, and self-evolving frameworks show that feedback can help. But the paper’s evidence also makes the uncomfortable point clear: solver-calling is not algorithm design.
For enterprises, the message is refreshingly unsentimental. Use LLMs in optimisation, but test them like optimisation systems. Build the harness. Keep the baselines. Measure coverage. Reward structure-aware algorithms. Treat self-evolution as search under budget, not as intelligence perfume sprayed on a weak solver script.
The future optimisation agent may be genuinely useful. FrontierOR’s contribution is to make it harder for the present one to bluff.
Cognaptus: Automate the Present, Incubate the Future.
-
Minwei Kong et al., “FrontierOR: Benchmarking LLMs’ Capacity for Efficient Algorithm Design in Large-Scale Optimization,” arXiv:2605.25246v3, 30 May 2026. ↩︎