From Words to Workflows: Why AI Still Struggles to Think Like an Operations Research Analyst

A warehouse manager does not ask for “a constraint optimization problem.” She asks whether tomorrow’s orders can be shipped without overtime.

A university administrator does not request “a mixed-integer formulation.” He asks whether lectures can be scheduled without room conflicts.

A retail planner does not want “a MiniZinc model.” She wants to know which stores should receive scarce inventory before the promotion starts.

This is the awkward middle layer of business automation: real people describe messy operational intent; mathematical systems need formal variables, constraints, objectives, data schemas, and solver-compatible code. Somewhere between those two worlds sits the operations research analyst, quietly converting language into structure while everyone else wonders why “AI should be able to do this by now.”

The paper Text2Model: Modeling Copilots for Text-to-Model Translation gives that assumption a useful stress test.¹ It does not ask whether a language model can sound competent about optimization. That part, regrettably, was solved some time ago. It asks whether LLM-based copilots can translate natural-language descriptions into executable and correct combinatorial models.

The answer is more interesting than a yes-or-no verdict. The systems often generate code that runs. They much less reliably generate code that solves the right problem. In business terms, that is the difference between a helpful assistant and an overconfident intern with commit access.

The main evidence is the execution-solution gap

The paper’s most important contribution is not that it proposes another AI copilot architecture. The industry has enough diagrams with arrows, boxes, and the occasional anthropomorphic “agent.” The useful part is the benchmark evidence: even when a generated model compiles and executes, it may still encode the wrong decision logic.

The authors evaluate two core metrics:

Metric	What it asks	Business interpretation
Execution accuracy	Does the generated MiniZinc model compile and run?	Can the system produce technically usable code?
Solution accuracy	Does the generated model produce the same objective value as the ground-truth model?	Did the system capture the actual optimization logic?

That distinction matters because enterprise AI demos often stop at the first metric. A model compiles. A solver returns an answer. A dashboard lights up. Everyone nods gravely at the output table.

Unfortunately, optimization is not impressed by vibes.

A formally valid model can still be semantically wrong. It may miss a constraint, use the wrong index set, reverse an inequality, misunderstand an objective, or produce a feasible answer to a different problem. In ordinary software, bugs can be annoying. In decision automation, they become hidden policy: the system allocates labor, inventory, money, or risk according to a logic nobody explicitly approved.

The paper’s main benchmark makes this gap visible. In the authors’ GPT-5.2 experiments across 1,318 benchmark instances, the strongest strategies solve 956 instances correctly, leaving 362 instances—about 27%—still out of reach. Execution can be much higher than solution accuracy, especially after validation. That is the central lesson: syntactic repair is easier than semantic repair.

Strategy family	What improves	What remains fragile
Zero-shot prompting	Baseline ability to draft a model	High failure rate; weak formal grounding
Chain-of-thought prompting	Better structure and decomposition	Still dependent on the model’s internal interpretation
Code validation	More compile-and-run success	Does not guarantee correct business logic
Grammar validation	Better syntactic discipline	Grammar cannot tell whether the model means the right thing
Agentic decomposition	Splits modeling into sub-tasks	Integration errors can erase the benefit

The paper therefore avoids the lazy “LLMs are bad at math” conclusion. The problem is sharper. LLMs can often produce plausible formal artifacts, but combinatorial modeling punishes small semantic errors. A single missing constraint can turn a scheduling system into fiction wearing a solver badge.

What Text2Model actually formalizes

The paper formalizes a task the business world usually treats as a vague aspiration: text-to-model translation.

In classical operations research, a human expert converts a problem description into a mathematical model. That model includes decision variables, parameter definitions, constraints, and an objective. A solver then searches for a feasible or optimal solution.

Text2Model asks whether LLM-based copilots can assist in that translation. The input is not just a paragraph of text. It includes a natural-language problem description, input parameters, output variables, metadata, and instance data. The target output is a valid MiniZinc model compatible with the instance data.

MiniZinc is an important choice here. It is a high-level modeling language designed to be solver-agnostic: the same model can be compiled toward different backend solvers, including constraint programming, SAT-style, and mathematical programming systems. This matters because the paper is not merely building a translator into one proprietary optimization API. It is trying to test whether language models can draft formal combinatorial models in a more general modeling layer.

That makes the task harder but more valuable. A solver-specific tool may perform well inside one narrow ecosystem. A solver-agnostic modeling copilot has a more ambitious job: preserve the business logic while remaining portable across solving technologies.

The authors group their Text2Model copilots into three broad families:

Copilot type	How it works	Likely purpose of the test
Single-call strategies	Generate the whole MiniZinc model in one LLM interaction	Baseline and structured-prompt comparison
Multi-call strategies	Add intermediate steps such as knowledge graphs, code validation, or grammar validation	Test whether feedback and structure improve reliability
Agentic strategies	Generate parameters, variables, constraints, and objectives in separate steps, then stitch them together	Test whether modeling decomposition helps or hurts

This is a sensible experimental ladder. Start with the cheap and obvious method. Add reasoning. Add validation. Add decomposition. Then see which complexity actually pays rent.

The uncomfortable result is that the more elaborate pipeline is not automatically better. The world did not need another reminder that “agentic” is not a synonym for “correct,” but here we are.

Text2Zinc is less flashy than the copilots, and probably more important

The paper’s second major contribution is Text2Zinc, a dataset for evaluating text-to-model translation. It combines natural-language problem descriptions, MiniZinc models, data files, expected outputs, and metadata across both satisfaction and optimization problems.

This is not a cosmetic detail. Benchmarks define what a field can honestly claim.

Before a dataset like this, many AI-for-optimization demonstrations could look convincing because they were evaluated on small examples, convenient problem types, or solver-specific tasks. Text2Zinc tries to impose a more disciplined setting. The paper reports 1,775 total natural-language problem instances, with a carefully selected, manually verified subset used as high-quality data. The sources include datasets and collections such as NLP4LP, ComplexOR, LPWP, CSPLib, Hakank’s MiniZinc collection, IndustryOR, MAMO, and NL4Opt.

The dataset is also structured around a useful schema: input, data, model, and output. That separation matters because real business modeling is not just “read a paragraph and write code.” The system must keep generic problem structure separate from instance-specific data. Otherwise, it risks hard-coding yesterday’s warehouse into tomorrow’s planning model, which is a surprisingly efficient way to automate mistakes.

Text2Zinc also includes an interactive editor for curation and validation. This may sound like tooling plumbing, but it is closer to infrastructure. If a company wants LLMs to help with decision modeling, the durable asset is not a clever prompt. It is a continuously curated library of problem descriptions, validated models, edge cases, solver outputs, and correction traces.

That is where the operational moat begins. Not in the chatbot window. In the boring database behind it. As usual, boring wins.

Grammar validation helps because syntax is a real failure mode

One of the paper’s clearest findings is that grammar and validation layers improve performance. This is not surprising, but it is easy to underappreciate.

MiniZinc is not Python. It has its own syntax, type system, array handling, declarations, predicates, and solver-facing semantics. LLMs have seen much less MiniZinc than mainstream programming languages. So they make errors that are not profound philosophical failures of intelligence; they are also ordinary distributional failures. Missing semicolons, invalid operators, undefined identifiers, incorrect array dimensions, and incompatible type instances are not glamorous, but they break models.

The paper’s appendix error analysis lists categories such as syntax errors, undefined identifiers, array and indexing issues, missing functions, variable redefinition, and flattening errors. These are execution-level failures. They explain why code and grammar validation can lift execution accuracy.

But the evidence also shows the ceiling of this fix. Grammar validation can help the model say something valid in MiniZinc. It cannot guarantee that the MiniZinc says what the business meant.

That is the execution-solution gap again. Syntax validation asks: “Is this sentence grammatical?” Solution validation asks: “Is this sentence true?” Those are not the same question, despite the heroic efforts of corporate slide decks to confuse them.

The paper’s GPT-4 and GPT-4o appendix experiments reinforce this point. A baseline GPT-4 setup reaches 32.73% execution accuracy and 17.27% solution accuracy. Chain-of-thought improves performance. Code validation with GPT-4o reaches 80.91% execution accuracy and 41.82% solution accuracy. That is a meaningful improvement, but not a finished product. Most business leaders would not accept a planning system whose internal model is correct less than half the time merely because the code now compiles beautifully.

More agents do not automatically mean better modeling

The agentic strategies in the paper are especially useful because they test a popular assumption: if the task is hard, split it into smaller agents.

There is a reasonable intuition behind this. A human modeler often thinks in parts: define parameters, define variables, write constraints, specify objective, then check consistency. The paper’s compositional strategy mirrors this workflow by asking separate steps to generate different sections of the MiniZinc model and then stitch them together.

The problem is that decomposition creates interfaces. Interfaces create integration risk. Integration risk creates bugs with better job titles.

If one agent defines a variable differently from how another agent uses it, the final model breaks. If the constraint generator assumes an index set that the parameter generator did not declare, the stitched model may fail. If the objective step misunderstands whether the problem is a satisfaction, minimization, or maximization task, the final system may run but solve the wrong thing.

The appendix results are blunt: in GPT-4 experiments, the agentic strategy reports 43.64% execution accuracy and 20.00% solution accuracy; adding code validation barely changes solution accuracy to 20.91%. The authors interpret this as evidence that over-segmentation can harm model quality because the difficulty shifts from sub-task generation to integration.

This is not an argument against agents. It is an argument against agent theater.

A useful modeling agent needs a shared representation, strict contracts between steps, validation that checks cross-section consistency, and a way to trace whether each generated constraint corresponds to a stated business requirement. Without that, decomposition just distributes confusion across more API calls.

The benchmark is not only a scoreboard; it is a diagnostic instrument

The paper compares its strategies with prior systems and specialized approaches such as Gala, Orlm, and OptiMind. The comparison should be read carefully.

Some methods are solver-specific. Some are fine-tuned models. Some report pass-at-k style metrics across multiple attempts. Text2Model’s own copilots are designed around MiniZinc and solver-agnostic modeling. The paper rightly treats several comparisons as directional rather than perfectly symmetrical.

That matters for business interpretation. A specialized model tuned to produce one solver’s format may perform strongly within that environment, but it may not generalize to a solver-agnostic workflow. Conversely, a MiniZinc-based copilot may sacrifice some narrow optimization for portability and evaluation clarity.

The practical question is not “which method wins the leaderboard?” The better question is: what failure mode does each method expose?

Test or comparison	Likely purpose	What it supports	What it does not prove
Zero-shot baseline	Establish lower-bound prompting performance	Direct text-to-model is brittle	That LLMs cannot help at all
Chain-of-thought	Test structured reasoning	Reasoning scaffolds improve generation	That reasoning alone gives deployable reliability
Code validation	Test feedback repair	Compilation feedback helps execution	That the business logic is correct
Grammar validation	Test syntax discipline	Formal syntax constraints reduce errors	That semantics are aligned
Knowledge graph intermediate representation	Test structured abstraction	Intermediate representations may help	That graph construction is always worth the extra call
Agentic decomposition	Test modular modeling workflow	Decomposition is plausible	That more agents improve accuracy
Orlm / OptiMind / Gala comparisons	Position against prior work	Solver-specific and agentic approaches are competitive in some settings	That one architecture dominates across all modeling paradigms

This is where the paper is most useful for product teams. It does not merely say “accuracy is low.” It points to where the system is failing: syntax, identifiers, array shapes, objective interpretation, constraint semantics, and integration across generated sections.

A benchmark that surfaces failure categories is more valuable than a benchmark that only produces a trophy. Trophies are for conference slides. Failure categories are for building systems that do not embarrass you in production.

What Cognaptus infers for business automation

The paper directly shows that LLM-based modeling copilots can improve text-to-model translation, especially when structured reasoning and validation are used. It also directly shows that the current systems are not plug-and-play decision modelers. Solution accuracy remains too inconsistent, and many instances remain unsolved even under the strongest tested strategies.

The business inference is therefore straightforward: the near-term value is assisted modeling, not autonomous operations research.

For companies, that still matters. Many internal decision processes already suffer from translation bottlenecks. A supply chain team knows the operational rule but waits for an analyst to encode it. A finance team knows the allocation objective but cannot formalize the constraint set. A scheduling team keeps requirements in emails, spreadsheets, and institutional memory, which is a polite phrase for “someone named Linda knows why this breaks every Wednesday.”

LLM copilots can help in three realistic ways.

First, they can draft candidate models faster. Even if the first draft is not correct, it gives a modeler something to inspect, modify, and test.

Second, they can expose ambiguity in the business requirement. If the generated model needs a variable bound, objective direction, or missing parameter definition, the system can force a useful clarification. This is valuable because many operational requirements are not wrong; they are underspecified.

Third, they can support model governance. A good system should not only output MiniZinc. It should map each constraint back to source language, list assumptions, run validation cases, compare outputs against known examples, and flag semantic uncertainty. In other words, the product should behave less like a magic text box and more like a junior modeler who documents what they think they heard.

The ROI is not “replace the OR team.” That is the kind of idea that looks efficient until the first quietly wrong replenishment plan ships. The more credible ROI is reducing model drafting time, improving requirement capture, building reusable model libraries, and creating validation workflows that make decision automation safer to scale.

A practical architecture would treat the LLM as one layer, not the whole system

The paper’s evidence points toward a layered architecture for business deployment.

Layer	Role	Why it matters
Requirement capture	Convert business language into structured inputs, parameters, outputs, and metadata	Prevents the LLM from guessing missing structure
Candidate model generation	Draft MiniZinc or another formal model	Speeds up first-pass formulation
Syntax and grammar validation	Catch compile-time and language errors	Raises execution reliability
Semantic validation	Test against known cases, expected outputs, and business invariants	Addresses the more dangerous correctness gap
Human review	Let OR analysts inspect assumptions and constraint mappings	Keeps accountability where it belongs
Model library and feedback loop	Store corrected models, failures, and domain patterns	Turns usage into institutional learning

The important part is semantic validation. It should include test instances, known edge cases, sanity checks, and requirement-to-constraint traceability. For example, if a planning rule says “no driver can exceed legal working hours,” the generated constraint should be explicitly linked to that rule. If the link is missing, the system should not pretend everything is fine because the solver returned an answer.

This architecture also explains why domain datasets matter. Generic model intelligence is useful, but domain-specific examples, validated outputs, and failure histories are what let a system improve in a particular business environment. A logistics modeler and a workforce scheduler may both use optimization, but their mistakes do not rhyme neatly enough for one generic prompt to govern them all.

The boundaries are precise, not decorative

There are three limitations worth keeping in view.

First, the target language is MiniZinc. That is a good choice for solver-agnostic modeling, but it means the results are not automatically transferable to every modeling language, solver API, or enterprise optimization stack.

Second, the benchmark combines curated and source-derived problems. This is valuable for evaluation, but business deployments will involve messier requirements, inconsistent terminology, changing constraints, and data quality issues. The real world has a remarkable talent for making benchmarks look civilized.

Third, solution accuracy is measured against ground-truth models and objective values. That is appropriate for the study, but real organizations may have multiple acceptable formulations, soft constraints, negotiated priorities, or rules that change mid-project. A production system must handle not only correctness but also preference ambiguity and governance.

These boundaries do not weaken the paper. They clarify its actual contribution. Text2Model is not a declaration that AI has solved optimization modeling. It is a structured way to measure how far the field still has to go.

The useful future is not natural language to decisions, but natural language to auditable models

The tempting product story is simple: describe a business problem, receive an optimized decision.

The more credible story is less cinematic and more useful: describe a business problem, receive a draft formal model, inspect its assumptions, validate its behavior, compare it against known cases, revise it with an analyst, and then deploy it as part of an auditable decision workflow.

That may sound less magical. Good. Magic is a poor operating model.

The paper’s evidence suggests that LLMs are becoming useful translators between business language and formal optimization systems. But the translation is not yet reliable enough to remove modeling expertise. The best current systems should be treated as copilots that accelerate formulation and validation—not autonomous analysts that silently define how scarce resources are allocated.

For Cognaptus readers, the practical takeaway is simple: if your business wants AI-driven decision automation, do not start by asking whether an LLM can “solve” operations research. Start by asking whether your organization has the structured problem descriptions, validated model examples, solver feedback loops, and human review processes needed to make such a copilot safe.

The bottleneck is no longer only mathematical modeling. It is modelable work: turning messy operational intent into a representation that machines can test, humans can audit, and organizations can trust.

That is less flashy than one-click optimization. It is also much closer to the future that will actually work.

Cognaptus: Automate the Present, Incubate the Future.

Serdar Kadıoğlu, Karthik Uppuluri, and Akash Singirikonda, “Text2Model: Modeling Copilots for Text-to-Model Translation,” arXiv:2604.12955, 2026, https://arxiv.org/abs/2604.12955. ↩︎

The main evidence is the execution-solution gap#

What Text2Model actually formalizes#

Text2Zinc is less flashy than the copilots, and probably more important#

Grammar validation helps because syntax is a real failure mode#

More agents do not automatically mean better modeling#

The benchmark is not only a scoreboard; it is a diagnostic instrument#

What Cognaptus infers for business automation#

A practical architecture would treat the LLM as one layer, not the whole system#

The boundaries are precise, not decorative#

The useful future is not natural language to decisions, but natural language to auditable models#