Simulate This: When LLMs Stop Talking and Start Modeling

A simulation model is not a chatbot with a spreadsheet attached.

That sounds obvious until a project team starts treating the LLM as if it were the entire modeling stack: the analyst, the programmer, the validator, the documentation clerk, the statistical package, and occasionally the intern blamed when the result changes on Tuesday. The convenient story is that better prompting will tame the system. Add more examples. Add a RAG. Set temperature to zero. Smile at the demo.

Philippe J. Giabbanelli’s guide to LLMs in modeling and simulation is useful because it refuses that convenient story.¹ It is not another tour of “AI will transform simulation.” It is closer to an engineering memo from someone who has spent enough time with these systems to know where the floorboards creak.

The paper’s central point is simple but operationally expensive: using LLMs in modeling and simulation is not a prompt-crafting problem. It is workflow design. Prompts matter, but so do decoding settings, input representation, retrieval design, parametric adaptation, repeated-run variability, model routing, tool integration, and the unglamorous habit of documenting what was actually done.

That is the difference between an LLM that produces a plausible model and a modeling workflow that can be tested, repeated, criticized, and improved. One is a demo. The other has a chance of becoming infrastructure.

The mistake is treating the LLM as the model, not as a component

The paper is written for modeling and simulation practitioners, not for ML specialists. That matters. In many organizations, the people commissioning simulations are not trying to publish a decoding paper. They are trying to answer business questions: where to place warehouses, how a policy shock may propagate, how agents in a market may respond, how a safety rule behaves under edge cases, or how a process bottleneck changes under different demand scenarios.

LLMs enter this world because they are good at translation. They can turn messy text into structured concepts, summarize model behavior, generate code, help explain simulation outputs, or mediate between human requirements and specialized tools. The temptation is to stretch that usefulness into full delegation.

The paper repeatedly pushes against that temptation. An LLM can help build or interpret a model. It does not automatically become a reliable modeling method simply because it returned something that looked reasonable.

The practical reading is this:

Category	Tempting but weak practice	Better operating principle
Prompting	Keep adding instructions until the answer looks right	Decompose the task and validate each stage
Hyperparameters	Leave defaults untouched or set temperature to zero	Report and optimize decoding choices by task
Knowledge augmentation	Add RAG because “more context helps”	Decide whether the workflow needs scenario facts, stable conventions, or both
Evaluation	Report average performance from a few runs	Measure variability, worst cases, and repeated-run agreement
Tool use	Ask the LLM to perform the whole task	Use the LLM to translate between humans and specialized tools
Architecture	Chain whatever works in a notebook	Document the workflow, model versions, retrieval design, and control points

This is why the paper is better read as a decision map than as a catalog of techniques. The techniques are familiar: prompts, RAG, LoRA, adapters, decoding hyperparameters, factorial designs, model checkers. The harder part is deciding which technique belongs to which layer of the system.

Category 1: Prompts are task interfaces, not magic spells

The paper’s discussion of prompting is deliberately practical. It does not pretend prompt engineering is dead, nor does it inflate it into a mystical profession. A prompt is the interface between a task and a model. That interface can be vague, overloaded, brittle, or well structured.

For modeling and simulation, this distinction is not cosmetic. Suppose the goal is to extract a conceptual model from a corpus. A weak prompt asks the LLM to “extract the model.” A stronger workflow decomposes the job: identify concepts, identify relationships, classify relationship types, validate each step, and output the result in a format that downstream code can parse.

That last phrase is important: downstream code. If the output has to feed a simulation pipeline, the LLM’s prose is not the deliverable. The deliverable is a structured artifact that can be checked and used.

The paper gives several prompt-level concerns that business teams often underestimate.

First, longer prompts are not automatically better. More examples, more instructions, and more context can degrade performance. This is particularly relevant for business process automation, where teams often respond to failures by appending another paragraph to the prompt. Eventually the prompt becomes a landfill of corrections, exceptions, and emotional damage.

Second, representation matters. The same mathematical object can be described in multiple ways, but LLM performance may differ sharply across those formats. The paper cites work where graph connectivity performance differed depending on whether the graph was described as edges or as neighbor lists. For simulation practitioners, that means the encoding of the model is part of the experimental design, not an implementation afterthought.

Third, output extraction is not a small annoyance. Even when a prompt asks for one label, the model may return a sentence. Sometimes regular expressions are enough. Sometimes structured output modes help. Sometimes a second, simpler model is used for extraction. But each choice adds cost, variability, and a new failure point.

The business implication is not “hire prompt engineers.” It is more precise: treat prompts as modular task specifications. Each prompt should have a role, expected input, expected output, validation rule, and failure-handling path. That looks less romantic than prompt wizardry, which is usually a good sign.

Category 2: Decoding settings are part of the method

Temperature is often treated like a personality knob: lower means serious, higher means creative. In modeling and simulation, that framing is too shallow.

The paper explains that decoding hyperparameters shape how probability distributions become generated text. These are not the training hyperparameters used to build the model. They are inference-time controls. They affect what the model does today, inside the workflow.

This creates two immediate problems.

The first is reproducibility. Many studies leave temperature at the provider default or do not report it. But defaults are not universal. One provider’s default may be another provider’s experiment. A paper, report, or internal validation document that says “we used GPT-like model X” without reporting decoding settings is not describing a method. It is describing a vibe.

The second problem is optimization. The paper reports cases where optimal temperature depended on the model, task, and other parameters. In one readability task, GPT-3.5 Turbo performed best at different temperatures depending on the context-window size. In another modeling task, some models barely reacted to temperature changes, while others could show much larger error changes under the wrong setting. The lesson is not that one temperature is best. The lesson is that temperature interacts with the task.

That matters for businesses building AI-assisted simulation or decision-support products. A team may optimize a prompt, evaluate it once, and then deploy it as if the behavior belongs to the prompt alone. But the behavior may also belong to the decoding regime, provider implementation, routing layer, cache behavior, and model version.

For internal governance, the minimum record should include:

Control item	Why it matters
Model provider and model ID	Names can be aliases; models can be updated or silently forwarded
Decoding parameters	Defaults vary and affect reproducibility
Prompt version	Prompt changes are method changes
Retrieval configuration	RAG changes what evidence enters the model
Structured output mode or parser	Extraction rules affect downstream data
Cache behavior	Caching can create an illusion of determinism
Repeated-run protocol	Single runs hide variability
Cost and latency mode	Cheap routing may change provider behavior

None of this is glamorous. Governance rarely is. But if a simulation output influences resource allocation, risk assessment, or operational policy, “we used the default settings” is not a methodological statement. It is a confession.

Category 3: RAG supplies context, not salvation

RAG is often sold as the cure for hallucination: connect the model to documents, retrieve relevant passages, and let truth enter through the side door.

The paper is more careful. Retrieved context enters the model as tokens. It does not overwrite parametric knowledge. It coexists with it. That coexistence can help, but it can also confuse, bias, overload, or degrade performance.

For modeling and simulation, RAG has a legitimate role. It can ground a simulation in policy documents, mission reports, software manuals, formal semantic repositories, scientific literature, previous agent interactions, traffic scenarios, or simulation-generated artifacts. The paper’s examples show RAG being used not merely to “answer questions,” but to keep model construction and simulation behavior aligned with context.

But the paper also notes that RAG has sometimes worsened performance. A case involving NetLogo code generation found that adding contextual knowledge did not improve the generated code. The possible reasons are familiar to anyone who has built a retrieval system: insufficient corpus coverage, noisy retrieval, poor chunking, weak integration, or an LLM leaning too strongly on contextual material even when it is irrelevant.

This gives a clean business distinction.

RAG is good for changing, inspectable, scenario-specific information: policy documents, client manuals, current procedures, local constraints, simulation outputs, and evidence that should be traceable.

RAG is not ideal for stable modeling conventions: how to express a causal loop diagram, how to structure an agent-based model critique, how to format scenario assumptions, or how to apply a repeated internal modeling style. For those, LoRA or adapters may be more appropriate, assuming the task is stable enough and the training data is good enough.

The paper’s RAG-versus-LoRA distinction is one of its most useful practical contributions:

Need	Better candidate	Reason
Scenario-specific facts	RAG	The knowledge changes and should be inspectable
Policy or client documents	RAG	Traceability matters
Stable modeling conventions	LoRA or adapter	The behavior should persist across prompts
Domain abstraction style	LoRA or adapter	The goal is not to retrieve facts but to enforce representation habits
Switching among known task regimes	Selection or routing methods	Different tasks may need different model pathways
Formal verification	Specialized tool, with LLM translation	The LLM should mediate, not replace the checker

For business automation, this distinction prevents a common architecture mistake: using RAG to fix every failure. Sometimes the system does not lack information. It lacks a stable operating convention. Stuffing more documents into the context window will not solve that. It will merely make the model wrong with a bibliography.

Category 4: Non-determinism is not removed by temperature zero

The paper’s most important operational warning is that non-determinism is broader than sampling. Setting temperature to zero can reduce one source of variation. It does not freeze the whole system.

Variation can come from sampling, quantization, attention optimizations, key-value cache management, provider routing, model updates, silent forwarding, account-level policy differences, geographic routing, and distributional properties of the model itself. A centralized API platform may route requests for the “same” model to different providers. Those providers may run different numerical formats or optimizations. The model name may remain stable while the implementation changes underneath.

This sounds technical because it is technical. It also has a very ordinary business consequence: the same workflow may not behave the same way tomorrow.

The paper argues that practitioners should evaluate non-determinism rather than merely complain about it. The goal is not always to eliminate variation. In low-risk, high-volume applications, cost and latency may matter more. But the decision should be informed.

Average accuracy is not enough. A system that averages 80% accuracy across runs may be stable on the same 80% of cases, or randomly correct on different cases each time. Those are not the same system. One has predictable blind spots. The other is a slot machine wearing a lab coat.

The paper points to measures such as worst-case accuracy, best-case accuracy, and total agreement rate across repeated runs. It also recommends using design-of-experiments logic to decompose how much variation comes from design choices, model parameters, retrieval choices, simulation parameters, and stochastic effects.

This is especially important because the paper reports a practical case in conceptual model merging. A simple direct-equivalence method showed negligible randomness. A more sophisticated synonym-and-antonym derivation method showed large randomness, with most variability attributed to non-determinism: 53% with GPT and 95% with DeepSeek in the reported setting. Without repeated runs, a team might wrongly conclude that a prompt feature or counterexample design caused the performance difference. The real driver could be uncontrolled variability.

For business use, this changes the evaluation question.

Do not ask only: “Did the LLM get the right answer?”

Ask:

Evaluation question	Business meaning
Does it give equivalent answers across repeated runs?	Can operations rely on stable behavior?
What is the worst-case performance?	Does it violate service-level or risk limits?
Are errors concentrated in identifiable cases?	Can humans review the right subset?
Does RAG reduce hallucination but increase variance?	Is traceability being bought with instability?
Does a cheaper provider route change results?	Are cost savings undermining reliability?
Does caching hide variability during testing?	Are demos overstating determinism?

This is where the paper becomes more than an academic guide. It gives a language for diagnosing the real failure mode: not “the model is random,” but “this workflow’s output variance is dominated by routing, representation, retrieval, or task formulation.” That is a much more useful sentence. Also less likely to appear in a LinkedIn carousel, which improves its credibility.

Category 5: “Working” is not the same as good science

The paper’s fourth section is a necessary slap on the wrist: LLM outputs can appear to work while degrading the quality of the scientific or modeling process.

The citation example is straightforward. An LLM can supply references. Some may even be real. But real references are not automatically relevant references, and relevant-looking references may not support the argument being made. If a research process becomes “ask the LLM for citations that support my claim,” the issue is not merely hallucination. It is confirmatory bias with better typography.

The same applies to data analysis and simulation code. An LLM may produce code that runs. It may even produce the correct result in a particular case. But generated code can still be low quality, insecure, fragile, or hard to maintain. In simulation, a generated implementation can “work” under one interpretation of the problem while failing under another equally plausible interpretation.

The paper discusses a case where an LLM-generated conjecture and LLM-generated simulation code interpreted the same text differently. The validation then failed, not because the model implementation was necessarily bad, but because the conjecture and implementation encoded different conceptual models. This is a subtle but severe failure: the validation pipeline tests interpretive consistency between separate LLM calls, not the adequacy of the simulation model itself.

For business teams, the lesson is direct. When LLMs are used in modeling workflows, every generated artifact should be tagged by role:

Artifact	What it should be judged against
Conceptual model	Domain interpretation and stakeholder assumptions
Simulation code	The approved conceptual model and technical tests
Conjecture or expected property	The same approved model interpretation
Narrative explanation	The simulation outputs and audience needs
Formal property	The requirements language of the verification tool
RAG evidence	Source relevance, traceability, and retrieval quality

The dangerous workflow is one where each artifact is generated independently from the same vague text description. That produces alignment theater. Everything looks related because the same words are floating around. The underlying assumptions may have quietly diverged.

Category 6: The best role for LLMs is often translation

One of the paper’s strongest recommendations is that LLMs should not replace specialized tools when those tools already exist. They should often translate into and out of them.

This is a pragmatic position. Model checkers, test generators, statistical packages, simulation engines, and formal verification tools exist for a reason. They are not always easy to use, but difficulty is not a sufficient reason to replace them with a probabilistic text generator.

A better architecture is mediation. The modeler states a requirement in natural language. The LLM translates that requirement into a formal representation. The specialized tool checks it. The LLM then translates the result back into a form the modeler can understand. If the formal tool returns parser errors or violated constraints, the LLM can use that feedback to refine the translation.

The paper’s evacuation example captures the point. A modeler may say that when an alarm triggers, agents evacuate; once they exit, they do not re-enter; doors can become blocked. A formal tool may need assumptions that the modeler left implicit: the alarm stays on, people do not keep entering during evacuation, and certain movement abstractions apply. The LLM can surface those missing premises. It should not silently decide them.

This distinction is crucial for AI product design.

A poor product says: “Ask the AI whether your model behaves reasonably.”

A better product says: “The AI translated your requirement into these formal properties, inferred these missing assumptions, sent them to this verifier, and received this result.”

That is not just a better UX. It is a better epistemic contract. The human can inspect where the modeler’s words ended, where the LLM inferred, and where the formal tool established a result.

Category 7: Architecture is where productivity claims go to be tested

The paper reviews several ways LLMs can be integrated with modeling tools.

In a tool-embedded architecture, the LLM sits inside the modeling environment and provides suggestions while authoritative actions remain in the tool. This is conservative and often sensible.

Another architecture uses two LLMs as translators between user needs and tool requirements. That can avoid a shared schema, but it doubles model calls, increases cost and latency, compounds non-determinism, and risks semantic drift through natural language as the intermediate representation.

A third approach uses one LLM as the interface to one specialized tool. That may work for a narrow workflow, but it can create “interface bleeding”: users gradually need to know the underlying tool anyway to use the LLM effectively.

For multiple tools, the paper points toward a more scalable architecture: a shared LLM backbone with lightweight task-specific modules or adapters. Instead of loading separate fine-tuned models for every tool, a system can maintain a base model and compose reusable translation capabilities. This supports multi-tenant serving, reduces memory overhead, and makes it possible to document sequences of transformations.

For businesses, the architecture question should not begin with “Which model should we use?” It should begin with “Which decisions need to be stable, inspectable, and testable?”

A useful internal architecture map might look like this:

Layer	Main question	Governance requirement
User request	What is the business or modeling task?	Capture intent and scope
Task decomposition	Which sub-tasks should exist?	Version the workflow
Knowledge layer	What facts or conventions are needed?	Separate RAG from LoRA/adapters
Translation layer	Which formal tools or simulators are involved?	Record inferred assumptions
Execution layer	What tool actually performs the operation?	Prefer deterministic specialized tools where possible
Evaluation layer	How stable and correct are outputs?	Use repeated runs and worst-case metrics
Explanation layer	What should the user see?	Distinguish source facts, LLM inferences, and tool results

This is where ROI becomes real. The value is not “the LLM can do simulation.” The value is reducing friction between human intent, formal tools, domain evidence, and decision-facing explanations. That is a narrower claim. It is also far more useful.

What Cognaptus infers for business implementation

The paper is a guide and synthesis, not a universal benchmark. It does not prove that one architecture always wins. It gives principles and examples that business teams can turn into an implementation checklist.

Cognaptus would translate the paper into four practical rules.

First, classify the LLM’s role before building the workflow. Is it extracting concepts, generating code, translating requirements, selecting evidence, explaining results, or simulating agents? Each role has different failure modes. “Use AI” is not a role.

Second, separate facts from conventions. Use RAG for changing, inspectable evidence. Use LoRA or adapters only when the workflow needs persistent behavior for stable tasks. Do not use retrieval to solve a representation problem.

Third, evaluate variability directly. Run repeated tests. Track worst-case outcomes. Compare surface agreement and decision-level agreement. Disable or account for caching when testing. Record provider, model ID, decoding settings, and routing assumptions. This is boring in the same way brakes are boring.

Fourth, delegate execution to specialized tools whenever possible. Let the LLM translate, orchestrate, and explain. Let formal tools check formal properties. Let simulation engines simulate. Let statistical packages calculate. The LLM is valuable because it can bridge representations, not because it should cosplay as every tool in the stack.

Boundaries: what the paper does not settle

The paper’s recommendations need to be applied with discipline.

It is not a new benchmark ranking LLMs for simulation. Its examples demonstrate mechanisms and risks, but they do not establish a universal performance ordering among GPT, Claude, Gemini, DeepSeek, or open-weight models. Those comparisons age quickly.

It also does not imply that every business workflow requires heavy experimental design. A low-risk internal assistant that summarizes simulation documentation may not need the same repeated-run protocol as an AI-mediated model validation system. The level of governance should match the consequence of failure.

RAG, LoRA, adapters, and tool-use architectures are moving targets. The paper itself notes that some techniques may be displaced or reshaped as models improve. That does not weaken the argument. It strengthens it. If the tools are unstable, the workflow discipline becomes more important, not less.

Finally, “temperature zero is not enough” should not be misread as “determinism is impossible, so give up.” The practical lesson is diagnostic: identify which source of variation matters, decide whether it matters for the use case, and mitigate it only when mitigation is worth the cost.

The article’s real message: stop asking for cleverness, start asking for control

The useful shift in this paper is from clever prompting to controlled workflow design.

That shift is uncomfortable because it removes the fantasy that a sufficiently artful prompt can make an LLM behave like a dependable simulation system. It also removes the opposite fantasy that LLMs are too stochastic to be useful in serious modeling. Both views are lazy in different directions.

LLMs can be useful in modeling and simulation when their role is explicit, their inputs are structured, their knowledge sources are chosen deliberately, their variability is measured, and their outputs are connected to tools that can actually perform formal work. They become dangerous when “it worked once” is treated as evidence.

The business opportunity is therefore not to replace modeling expertise with AI. It is to make modeling expertise easier to express, test, reuse, and explain. LLMs can help with that. But only if they are placed inside an architecture that knows the difference between talking about a model and modeling.

The prompt is not the product. The workflow is.

Cognaptus: Automate the Present, Incubate the Future.

Philippe J. Giabbanelli, “A Guide to Large Language Models in Modeling and Simulation: From Core Techniques to Critical Challenges,” arXiv:2602.05883, 2026. https://arxiv.org/abs/2602.05883 ↩︎

The mistake is treating the LLM as the model, not as a component#

Category 1: Prompts are task interfaces, not magic spells#

Category 2: Decoding settings are part of the method#

Category 3: RAG supplies context, not salvation#

Category 4: Non-determinism is not removed by temperature zero#

Category 5: “Working” is not the same as good science#

Category 6: The best role for LLMs is often translation#

Category 7: Architecture is where productivity claims go to be tested#

What Cognaptus infers for business implementation#

Boundaries: what the paper does not settle#

The article’s real message: stop asking for cleverness, start asking for control#