Lean LLMs, Heavy Lifting: When Workflows Beat Bigger Models

Opening — Why this matters now

Everyone wants LLMs to think harder. Enterprises, however, mostly need them to think correctly — especially when optimization models decide real money, real capacity, and real risk. As organizations scale, optimization problems grow beyond toy examples. Data spills into separate tables, constraints multiply, and naïve prompt‑to‑solver pipelines quietly collapse.

The paper behind today’s discussion introduces LEAN‑LLM‑OPT, a system that delivers an unfashionable but effective message: large language models do not fail because they are too small — they fail because we ask them to do too much at once.

Background — From prompting to orchestration

Early attempts at LLM‑driven optimization followed a simple logic: describe the problem, ask the model to generate a formulation, solve it. This works — briefly — for small linear programs where all information fits neatly into the prompt.

Once datasets become external, heterogeneous, and large‑scale, performance degrades sharply. Prior work tried to patch this gap via:

Approach	Core Limitation
Prompt engineering	Fragile and non‑scalable
Fine‑tuning	Expensive and domain‑specific
Solver‑aware models	High training cost, limited portability

LEAN‑LLM‑OPT takes a different route: system design over model size.

Analysis — What LEAN‑LLM‑OPT actually does

LEAN‑LLM‑OPT separates optimization modeling into what must be reasoned and what can be standardized.

The agentic workflow

Instead of a single monolithic prompt, the system uses:

Upstream planner agents — construct a step‑by‑step modeling workflow based on problem type
Downstream generator agent — follows this workflow to produce the final formulation
Tooling layer — handles data retrieval, parsing, and bookkeeping

This division offloads mechanical tasks and preserves cognitive bandwidth for constraint logic and coefficient placement — precisely where LLMs struggle most.

Crucially, workflows are interpretable, modular, and reusable. No retraining required.

Findings — Results that actually matter

The results are blunt.

Large‑Scale‑OR benchmark (execution accuracy)

Model	Overall Accuracy
GPT‑4.1	14.85%
gpt‑oss‑20B	25.74%
Gemini 3 Pro	52.48%
LEAN‑LLM‑OPT (GPT‑4.1)	85.15%
LEAN‑LLM‑OPT (gpt‑oss‑20B)	80.20%

The same base models jump from near‑random correctness to production‑grade reliability — without additional training.

Even more interesting: workflow‑only or tools‑only variants collapse to near zero accuracy. Structure and tooling are complements, not substitutes.

Real‑world validation: Airline revenue management

On Singapore Airlines–style fare capacity allocation problems, LEAN‑LLM‑OPT reaches 93% execution accuracy, while base GPT‑4.1 fails entirely. The model is not discovering optimization theory; it is being guided through it.

Implications — Why this should change how you build AI systems

Three uncomfortable takeaways for AI teams:

Model upgrades are not strategy Older and smaller models, when properly orchestrated, outperform newer giants used naïvely.
Workflow design is now an AI primitive Operations research quietly becomes a design manual for agentic AI.
Cost, privacy, and control improve together Open models + structured workflows reduce both inference cost and governance risk.

This aligns naturally with emerging ideas of assured autonomy — systems that are verifiable, constrained, and auditable by design.

Conclusion — The unglamorous future of useful AI

LEAN‑LLM‑OPT does not make LLMs smarter. It makes them behave. In enterprise optimization, that distinction matters far more than parameter counts or leaderboard wins.

The future of applied AI will look less like a larger brain and more like a better organization chart.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — From prompting to orchestration#

Analysis — What LEAN‑LLM‑OPT actually does#

The agentic workflow#

Findings — Results that actually matter#

Large‑Scale‑OR benchmark (execution accuracy)#

Real‑world validation: Airline revenue management#

Implications — Why this should change how you build AI systems#

Conclusion — The unglamorous future of useful AI#