TL;DR for operators

Real business problems do not arrive as tidy exam questions. They arrive as “Can we optimise this logistics network?”, “Which markets should we prioritise?”, “How many clinics do we need?”, or “What happens if the subsidy disappears?” The annoying part is not the equation. The annoying part is deciding what the equation should even represent.

That is the useful frame for ModelingAgent, a paper that treats mathematical modelling as a test of practical intelligence rather than schoolroom calculation.1 The authors build ModelingBench, a benchmark of 68 open-ended real-world modelling tasks from COMAP-style contests, then test whether LLM systems can produce complete modelling reports using tools, data, code, assumptions, analysis, and written justification.

The main lesson is satisfyingly unfashionable: giving an LLM tools is not the same as giving it a workflow. Tool access improves data groundedness, but only modestly. The larger gains come from splitting the job into specialised roles—idea generation, data search, model implementation, report writing—and forcing those roles through critic-guided refinement.

For operators, this is not a “replace analysts” paper. It is closer to a blueprint for building AI-supported decision teams: one agent frames options, one gathers evidence, one implements models, one writes the memo, and a critic challenges the work before a human signs off. Boring governance, unfortunately, remains undefeated.

The boundary is equally important. The benchmark is small, curated, and partly constrained by what text-first LLMs can handle. Automated judging is useful but not neutral. Visual reasoning is not deeply tested. High-stakes decisions still need domain experts, validation, and accountability. Anyone reading this as “AI can now solve messy strategy problems end-to-end” has, as usual, sprinted past the evidence and into the gift shop.

The hard part is not calculation; it is formulation

Most LLM evaluation still rewards the ability to answer well-posed questions. Solve the algebra. Pick the correct option. Prove the lemma. Write the code. This has value, but it misses the ugliest part of real-world problem solving: problem formulation.

A mathematical modelling task begins before computation. It asks the solver to convert a situation into a tractable abstraction. Which variables matter? Which constraints can be ignored without committing analytical vandalism? Which data sources are credible? Which objective should be optimised? What assumptions must be made explicit? How should uncertainty be handled? And finally, can the result be explained in a way a decision-maker can use?

This is why the paper’s choice of benchmark matters. ModelingBench is built from international mathematical modelling competitions: MCM, ICM, HiMCM, MidMCM, and IM2C. These are not single-answer puzzles. They are report-based modelling challenges across domains such as public health, emergency services, environmental engineering, sports analytics, finance, ecosystems, and operations management.

The benchmark contains 68 curated problems, spanning 2000 to 2025, with an average of 7.31 subtasks per problem and more than 70 domains represented. The difficulty split is 6 easy, 38 medium, and 24 hard problems. That distribution matters because the authors are not merely asking whether a model can calculate. They are asking whether it can sustain a multi-stage modelling workflow across ambiguous problem statements.

The old benchmark habit is to measure whether a model reaches an answer. ModelingBench instead asks whether the model can produce a defensible modelling report. That is a higher bar, and a much less comfortable one.

ModelingBench is designed to make ambiguity measurable

The benchmark construction has three important design decisions.

First, the authors start from real contest problems but filter them for feasibility. They use GPT-4o to rate candidate problems on data accessibility, modelling difficulty, and image clarity, then manually check and refine the set. Problems requiring physical measurement, inaccessible data, or difficult visual interpretation are excluded or simplified.

That filtering is a strength and a limitation at the same time. It improves benchmark quality and makes the tasks usable for current LLM systems. It also means the benchmark avoids some of the nastiest real-world modelling conditions: messy maps, ambiguous diagrams, physical measurement, proprietary data, sensor noise, and institutional constraints. Reality, being inconsiderate, contains all of those.

Second, ModelingBench gives systems a sandbox of tools. These include file reading and writing, web search, URL extraction, web download, image captioning, OCR, PDF parsing, Python execution, and a general solution generator. This is deliberately closer to how human modelling teams work. They search for data, inspect documents, run code, produce intermediate files, and assemble a report.

Third, the benchmark accepts that modelling has multiple valid solutions. This is crucial. In enterprise settings, two modelling teams can produce different but defensible approaches: one may use simulation, another optimisation, another causal analysis, another scenario planning. The question is not always “which answer is correct?” It is often “which assumptions, evidence, and method are most defensible for the decision at hand?”

That is why the paper needs more than a benchmark. It also needs an architecture and a judge.

ModelingAgent divides labour before it divides the problem

The paper’s central system, ModelingAgent, is a four-role multi-agent framework coordinated by shared memory and critic feedback.

Agent role Technical responsibility Operational analogue
Idea Proposer Decomposes the problem, proposes modelling approaches, abstracts subtasks, and refines ideas Strategy analyst framing possible approaches
Data Searcher Identifies needed variables, searches for real-world data, and organises evidence Research analyst or data acquisition lead
Model Implementor Converts ideas into mathematical formulations, writes code, and analyses results Quantitative modeller or data scientist
Report Writer Synthesises the workflow into a coherent final report Consulting-style memo writer
Critic Module Scores and critiques intermediate outputs against role-specific rubrics Senior reviewer, mentor, or red-team lead
Shared Memory Stores intermediate artefacts, trajectories, decisions, and agent outputs Project workspace or analytical audit trail

The important design move is not simply “multiple agents talk to each other.” That phrase has become the conference-paper equivalent of adding coriander to everything. The useful move is role-specific accountability.

Each agent has a bounded job. The Idea Proposer is not also expected to fetch data, implement code, and write the final report. The Data Searcher is judged on data quality and reliability, not prose. The Model Implementor is judged on mathematical formulation and computational execution. The Report Writer must retrieve from shared memory and create a coherent document.

This matters because open-ended modelling tasks fail in different ways. A system can have a clever idea but poor data. It can gather data but use the wrong model. It can implement the model but write an incoherent report. It can produce beautiful prose over invalid assumptions, which is the traditional consulting hazard and now, apparently, a software feature.

ModelingAgent’s design makes these failure modes more separable. That is the point.

The critic module is the engine, not a decorative reviewer

The critic module is not a final grading layer added after the system is done. It is embedded into the workflow. For each agent target, the critic scores candidate outputs using rubrics, gives feedback, discards weaker candidates, refines stronger ones, and asks the agent to explore replacements.

The algorithm is simple in spirit:

  1. Generate multiple candidate solutions.
  2. Score each against role-specific rubrics.
  3. Keep the stronger candidates.
  4. Refine them using feedback.
  5. Replace discarded candidates with new explored alternatives.
  6. Repeat for a fixed number of iterations.
  7. Select the highest-scoring candidate for the next stage.

The role-specific rubrics are practical. For idea proposal, the critic evaluates relevance, mathematical rigor, and practical feasibility. For mathematical formulation, it checks comprehensiveness, rigor, and executability. For data search, it checks data quality, reliability, and file structure. For implementation and analysis, it checks approach, code quality, report quality, and whether the work contains justified assumptions and sensitivity analysis.

This is one of the paper’s better operational insights. Many agent systems treat self-critique as a vague request: “improve your answer.” ModelingAgent turns critique into a more explicit optimisation loop. The critic is not just nagging the system. It is selecting, pruning, and steering.

The paper’s case study illustrates the mechanism. In one example, a GPT-4o-based ModelingAgent initially proposes standard risk assessment for a feasibility subtask. The critic finds it relevant but insufficiently quantitative. In the next round, the approach becomes a Monte Carlo simulation-based risk assessment, improving the method’s analytical depth and critic score.

That does not prove the system has deep causal understanding. It does show something useful: structured feedback can move the output from generic business-language modelling toward a more operational quantitative method. A small mercy, but a mercy nonetheless.

ModelingJudge evaluates reports, not isolated answers

Open-ended modelling cannot be graded like multiple-choice arithmetic. The authors therefore introduce ModelingJudge, an LLM-based evaluation framework inspired by modelling competition judging.

ModelingJudge evaluates the final report along three major dimensions:

Evaluation dimension What it measures Why it matters
Structural coherency Whether the report has clear sections such as assumptions, formulation, solution process, and analysis A correct fragment is not a usable decision document
Solution completeness Whether the report addresses all subtasks and requirements Real modelling fails when ignored requirements hide in plain sight
Solution quality Groundedness of modelling, data, analysis, plus innovativeness The method must fit the situation, evidence, and decision context

For solution quality, the judge uses multiple expert roles: a mathematical modelling expert, a data analysis expert, and two domain-specific experts selected for the problem. This is sensible because a model for forestry, opioid use, or ambulance placement should not be evaluated only by generic mathematical neatness. Domain relevance matters. The world continues to be annoyingly domain-specific.

Still, this is also where interpretation needs discipline. ModelingJudge is itself LLM-based. The authors use GPT-4o consistently across experiments to keep comparisons stable, but consistency is not the same as objectivity. Automated judges can reflect prompt bias, model preference, verbosity preference, or hidden scoring artefacts. The paper recognises this and supplements automated judging with human evaluation, but the main quantitative table still depends on the judge framework.

So, ModelingJudge is best read as a scalable evaluation instrument, not as an oracle. The oracle industry remains oversupplied.

The main results show that tools help, but structure helps more

The experimental setup compares three configurations:

Setting Likely purpose in the paper What it tests
Vanilla generation Baseline What the model can do from the prompt alone, without tools
Base Tool Agent Ablation / strong baseline Whether tool access and planning are enough without structured role division
ModelingAgent Main method Whether role-specialised agents plus critic refinement improve modelling reports
Top human report Upper-bound comparison How model outputs compare with award-winning human modelling reports

The results are the heart of the paper’s misconception correction. The lazy reading is: “Agents need tools.” The better reading is: “Tools reduce one bottleneck, but workflow design reduces several.”

In the reported results, vanilla generation often produces structurally decent but weakly grounded reports. Tool agents improve several groundedness metrics, especially data groundedness, but the average score gains are uneven. For GPT-4o, the average score rises from 50.76 under vanilla generation to 55.35 as a tool agent. For Deepseek-Chat, it rises from 56.14 to 63.57. For QwQ-32B, it rises from 49.13 to 64.97. That is meaningful, but it is not magic. A model with tools can still wander, over-search, under-integrate, or produce a pile of partially connected artefacts.

ModelingAgent produces the stronger shift. GPT-4o reaches an average score of 73.08, Deepseek-Chat reaches 82.42, Qwen2.5-72B-Instruct reaches 78.05, and QwQ-32B reaches 78.92. The largest reported jump from tool-agent baseline to ModelingAgent appears for Llama3.1-70B-Instruct, rising from 42.88 to 70.22, a 27.34-point improvement.

The pattern is more important than any single model. ModelingAgent improves not only data grounding but also modelling groundedness, analysis groundedness, completeness, and innovativeness. This supports the paper’s mechanism-first claim: the framework works because it decomposes the modelling workflow, not because it bolts a search tool onto a chatbot and hopes for civilisation.

A few details deserve care. Gemini-2.0-Think behaves oddly: its structural coherence drops sharply under ModelingAgent compared with vanilla generation, although its analysis and innovativeness improve. This is a useful reminder that multi-agent workflows can disturb otherwise strong single-model behaviours. Orchestration is not free. Coordination can create failure modes of its own.

The award-winning human report still posts the highest overall average at 84.63, with especially strong structural coherence (96.30) and solution completeness (98.83). That gap is telling. LLM agents are improving on the machinery of modelling, but humans still appear better at satisfying the full surface area of a complex prompt and packaging the answer coherently.

Human evaluation cuts against both complacency and panic

The human evaluation adds a useful but limited second view.

The authors recruit 12 volunteer evaluators, 60% with national or international mathematical modelling competition experience, and ask them to rank modelling implementations in arena-style comparisons. The evaluation includes model ranking, method ranking, and a Turing-test-like setup where participants identify which solution seems most likely written by humans.

The results are interesting. Under the ModelingAgent framework, QwQ-32B is most often ranked first by human evaluators. In method comparisons using GPT-4o outputs, ModelingAgent is ranked first 45.83% of the time, while the human expert solution is ranked first 41.67% of the time. The base tool agent and vanilla generation lag behind.

The paper also reports that model-generated solutions are indistinguishable from top human solutions over 50% of the time in the Turing-test setting.

This sounds dramatic, but it needs careful handling. A short human study with 12 volunteers is not a deployment guarantee. The evaluators ranked selected outputs, not live decisions with accountability, external validation, cost constraints, stakeholder conflict, and legal exposure. The study shows that ModelingAgent can produce modelling write-ups that are persuasive and often preferred in controlled comparisons. It does not show that the outputs are safe to use unreviewed in public health, finance, infrastructure, or resource allocation.

For business readers, that distinction is everything. A report that looks human-quality may still contain a brittle assumption, weak data source, or unvalidated extrapolation. The more polished the report, the more dangerous the hidden flaw. Elegance is not evidence. It is sometimes just better typography for being wrong.

The real business pattern is structured decision support

The paper’s business relevance is not that enterprises should deploy autonomous modelling agents tomorrow morning and call it transformation. Please do not. The more useful takeaway is a design pattern for AI-assisted analytical workflows.

A messy business question can be routed through specialised stages:

Business stage Agent analogue Human control point
Frame the decision Idea Proposer decomposes the problem and suggests modelling approaches Executive or domain owner confirms the objective and constraints
Gather evidence Data Searcher finds and documents relevant data Analyst validates source credibility and coverage
Build the model Model Implementor formalises assumptions and runs analysis Quantitative expert reviews method and implementation
Write the decision memo Report Writer synthesises assumptions, evidence, results, and recommendations Decision owner checks clarity, risk, and actionability
Challenge the output Critic scores gaps and proposes refinements Senior reviewer or red team approves, rejects, or escalates

This is applicable to many enterprise domains:

  • demand forecasting with uncertain market data;
  • location planning for hotels, clinics, warehouses, or retail;
  • pricing and margin simulations;
  • supply chain risk assessment;
  • policy impact modelling;
  • energy use optimisation;
  • capital allocation scenarios;
  • operational staffing models;
  • customer segmentation and product prioritisation.

The ROI is not “the AI solves the whole problem.” The ROI is narrower and more believable: faster first-pass modelling, cheaper exploration of alternatives, better documentation of assumptions, more consistent review rubrics, and improved analyst leverage.

In other words, ModelingAgent points toward AI as an analytical scaffold. It can help teams move from a vague question to a structured modelling memo. It can propose candidate methods and gather evidence. It can create a draft implementation. It can expose what needs review. That is useful enough. We do not need to pretend it is a synthetic McKinsey team living in a browser tab.

What the paper directly shows, and what Cognaptus infers

It is worth separating the evidence from the extrapolation.

Claim Evidence in the paper Business meaning Boundary
Real-world modelling requires more than mathematical reasoning ModelingBench tasks require tool use, data, modelling, assumptions, and reports Enterprise AI evaluation should include workflow quality, not just answer accuracy Benchmark is curated and contest-derived
Tool access improves groundedness Tool-agent baselines often improve data and analysis metrics over vanilla generation Search, files, code, and document tools are necessary for practical work Tools alone do not guarantee coherent synthesis
Role-specialised agents improve modelling performance ModelingAgent improves average scores across most tested models and metrics Divide analytical labour into scoped agent roles rather than one general assistant Coordination can introduce new failures
Critic-guided refinement helps outputs improve over rounds Critic trend analysis shows upward scoring across refinement rounds Review loops should be built into agent systems, not added after deployment Critic scores may contain bias
Human-quality appearance is achievable in some settings Human evaluators often prefer ModelingAgent outputs and sometimes fail to distinguish them from human reports AI-generated modelling reports may become persuasive enough for operational use Persuasiveness is not the same as correctness
Human experts still lead on completeness and structure Award-winning human reports score higher overall, especially on coherence and completeness Final accountability should stay with expert reviewers Human report comparison is an upper-bound reference, not a complete audit

This table is the sober version of the article. Less glamorous, more useful. A rare trade.

The limitations are not footnotes; they shape deployment

The paper’s limitations are not decorative. They materially affect how the work should be used.

First, visual reasoning is under-tested. The authors acknowledge that many real modelling problems require interpreting maps, charts, visual layouts, or complex diagrams. ModelingBench uses tools such as image captioning and OCR as partial support, but this is not native visual reasoning. During curation, problems requiring physical simulation or difficult visual understanding were excluded. For industries such as construction, logistics, agriculture, urban planning, and insurance, that is a major boundary.

Second, the benchmark is small and curated. Sixty-eight problems are enough to make a serious research contribution, but not enough to cover the range of enterprise modelling. The benchmark depends on available contest problems and careful manual filtering. This makes the dataset cleaner, but also less representative of the fully chaotic business world, where data is not merely “hard to find” but often inconsistent, private, political, or strategically distorted.

Third, automated judging remains fragile. ModelingJudge is clever, but LLM judges can favour polished prose, familiar structures, or model-like reasoning patterns. The authors try to reduce this with multiple expert roles and human evaluation, but the issue does not disappear. Any enterprise version needs calibration against human experts, historical decisions, real outcomes, and failure cases.

Fourth, innovation remains weak. Even after ModelingAgent improves innovativeness scores, the paper notes that innovativeness is consistently among the lowest-scoring dimensions. This matters because many real modelling problems are not solved by selecting the nearest textbook method. They require reframing, domain intuition, and sometimes the courage to say the available model is the wrong model. LLM agents are still better at recombining known patterns than discovering genuinely new ones. Shocking: machines trained on past text are quite fond of the past.

Fifth, the outputs are reports, not deployed decisions. A modelling report is a step in decision-making. It is not the decision, the implementation plan, the legal review, the stakeholder negotiation, or the post-deployment monitoring system. Treating report generation as decision automation would be an impressively efficient way to misunderstand the paper.

The architecture lesson is governance disguised as agent design

The most durable insight from ModelingAgent is that practical AI systems need structured responsibility. The paper is nominally about mathematical modelling, but its deeper lesson applies to agentic systems more broadly.

A useful agent workflow needs:

  • role separation, so one model is not pretending to be researcher, modeller, engineer, writer, and reviewer at once;
  • shared memory, so intermediate decisions and artefacts are inspectable;
  • tool access, so outputs can be grounded in external data and computation;
  • critic loops, so the system can revise before final delivery;
  • evaluation rubrics, so quality is not left to vibes in a blazer;
  • human oversight, so persuasive mistakes do not become operational policy.

This is why ModelingAgent is more interesting than yet another “LLM beats baseline” result. It sketches a pattern for turning generative models into structured analytical systems. Not perfect systems. Not autonomous experts. Systems with a workflow, review points, and failure surfaces that can at least be named.

For Cognaptus-style enterprise AI, that is the useful direction: not a chatbot that answers strategy questions, but an analytical pipeline that helps teams formulate, investigate, model, critique, and document decisions.

The model should not be treated as the expert. It should be treated as a junior modelling team with tireless energy, uneven judgement, decent tools, and a suspiciously confident writing style. Useful, yes. Dangerous when unsupervised, obviously. Potentially transformative when wrapped in process, governance, and domain expertise. There, the adults are back in the room.

Conclusion: divide the work, then trust the review

ModelingAgent shows that real-world problem solving is not improved by tool access alone. It improves when the system is organised around the actual structure of analytical work: framing, evidence gathering, modelling, reporting, and critique.

That is why the paper matters. ModelingBench gives researchers a more realistic target than tidy math benchmarks. ModelingAgent offers a concrete architecture for multi-agent modelling workflows. ModelingJudge provides a scalable, if imperfect, way to evaluate open-ended reports. Together, they move the conversation from “Can LLMs solve math?” to “Can LLM systems support disciplined modelling under ambiguity?”

The answer is: increasingly, yes—but not without structure, not without review, and not without boundaries.

Which is less thrilling than “AI solves real-world problems,” but much closer to being useful.

Cognaptus: Automate the Present, Incubate the Future.


  1. Cheng Qian, Hongyi Du, Hongru Wang, Xiusi Chen, Yuji Zhang, Avirup Sil, Chengxiang Zhai, Kathleen McKeown, and Heng Ji, “ModelingAgent: Bridging LLMs and Mathematical Modeling for Real-World Challenges,” arXiv:2505.15068, 2025, https://arxiv.org/abs/2505.15068↩︎