Cloudy With a Chance of Local Models: When On-Prem AI Starts Beating the API
Server room. That phrase used to sound like a warning label in enterprise AI strategy. If a company wanted serious model capability, the usual advice was simple: use a cloud API, negotiate procurement terms, and pretend the legal team was not reading the data-processing agreement with growing despair.
Local AI was framed as the compromise option: useful for privacy, acceptable for demos, and slightly embarrassing when compared with the fashionable API leaderboard. The familiar story was that cloud models had the brains, while local models had the air gap.
A new benchmark on System Dynamics AI assistants makes that story look too lazy.1 Not entirely wrong. Just too lazy, which is often worse because it survives longer in slide decks.
The paper evaluates cloud and local large language models on two purpose-built benchmarks: a 53-test causal loop diagram extraction benchmark, and a discussion benchmark covering model-building guidance, feedback explanation, and error-fixing suggestions. The main result is not that local models have “beaten the cloud.” That would be the cheap headline. The better reading is more operational: the cloud/local gap is now task-shaped.
For some structured System Dynamics tasks, local models are not merely acceptable. They match or exceed cloud models. For other tasks, especially iterative model editing and long-context error diagnosis, cloud models or carefully chosen local fallbacks still matter. The useful question is no longer “cloud or local?” It is “which task, which model, which backend, and how much context?”
That is a less viral question. It is also the one that determines whether an AI deployment works.
The benchmark tests a workflow, not just a model personality
System Dynamics uses causal loop diagrams, or CLDs, to represent variables, causal links, link polarities, and feedback loops. A useful AI assistant in this domain cannot merely produce a fluent explanation of “complex systems.” It must convert messy text into structured causal graphs, preserve existing relationships when updating a model, and explain why a feedback structure behaves the way it does.
The paper therefore separates the work into categories. This is important because the categories do not fail in the same way.
| Benchmark area | What the model must do | Why it is operationally different |
|---|---|---|
| Conformance | Produce schema-valid CLD output with correct variables, links, polarities, and constraints | Tests discipline under structure, not just comprehension |
| Translation | Convert natural-language system descriptions into CLDs | Tests causal extraction from prose |
| Iterative model building | Update an existing CLD while preserving prior relationships | Tests editing memory and preservation, not just extraction |
| Causal reasoning | Trace indirect effects through feedback structures | Tests loop-aware reasoning; only three tests, so results are indicative |
| Discussion tasks | Coach model building, explain feedback, and suggest error fixes | Tests conversational System Dynamics assistance under varying context length |
That separation is the paper’s real editorial gift. A single leaderboard would invite the usual ranking reflex: model A beats model B, therefore buy model A. The benchmark makes that reflex look crude. A model can be strong at extracting a new CLD and weak at updating an old one. A backend can be reliable for JSON and dangerous for long-context dense models. A tiny model can outperform giant ones on a narrow editing task. Apparently, “bigger model good” is not a deployment strategy. Who knew.
How to read the evidence before turning it into business advice
The paper contains several kinds of evidence, and they should not be treated equally.
| Evidence type in the paper | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| CLD leaderboard, 53 tests | Main evidence | Relative performance on structured CLD extraction categories | General superiority across all enterprise tasks |
| Discussion leaderboard | Main evidence for assistant behavior | How local models handle coaching, feedback explanation, and error fixing | Full production readiness for long-context assistants |
| Backend comparisons: llama.cpp vs. mlx_lm | Implementation and deployment evidence | Backend choice changes JSON reliability, latency, and failure modes | That one backend is universally better |
| Parameter sensitivity tests | Robustness and sensitivity evidence | Some models are sensitive to temperature, top-p, top-k, and prompt engine | A universal decoding recipe for all LLMs |
| Failed-model analysis | Implementation failure diagnosis | Some poor results reflect tooling, template, or prompt-length failures | True model capability ceilings |
| Energy and edge-appliance appendices | Scenario analysis and exploratory extension | How performance parity could affect infrastructure choices | Direct metered proof of fleet-level energy economics |
This distinction matters because the article’s business interpretation should not overclaim. The CLD benchmark directly supports task-level deployment conclusions in System Dynamics. The appendices suggest infrastructure implications, but they are conditional scenario analyses. The task-routing result is a post hoc upper bound, not a fully deployed router. A serious reader should keep those boundaries intact. A less serious reader may turn them into a keynote slide by Tuesday.
Local models are already strong at translation and conformance
The headline CLD result is straightforward. The best cloud model, Gemini 2.5 Flash, scores 89% on the 53-test CLD leaderboard. The best single local model, Kimi K2.5 GGUF Q3 served through llama.cpp, scores 77%. That is not top-cloud performance, but it matches mid-tier cloud performance: o4-mini is at 75%, Gemini 2.5 Pro is at 79%, and the best local model sits in that range.
The more interesting result appears inside the categories.
On translation, Kimi K2.5 reaches 23/24. That is almost the cloud ceiling: Gemini 2.5 Flash reaches 24/24, while several local models reach 22–23/24. The paper notes that the universally difficult case involves multi-hop causation through an implicit intermediate variable. That is a meaningful boundary. Local models are strong when causal relationships are expressible from the text, but the hardest implicit inference case remains fragile across the board.
On conformance, local models are even more provocative. Kimi K2.5 reaches 16/18, while DeepSeek V3.2 MLX-4 reaches 17/18. Cloud models cluster around 11–16/18. In the head-to-head comparison, DeepSeek V3.2 MLX-4 achieves 94% on conformance, above the best cloud score of 83%.
This is where the old privacy-compromise framing breaks. For structured internal automation, the local model is not merely “good enough if legal says no cloud.” In this benchmark, the best local setup is sometimes better.
For business users, conformance is not a decorative category. It is the difference between an AI assistant that produces usable structured output and one that writes a persuasive paragraph about why it has not followed the schema. If a workflow depends on strict JSON, diagram edges, taxonomy fields, audit labels, or workflow states, conformance is not clerical. It is the product.
The paper also identifies a useful failure pattern: maximum-cardinality constraints remain hard. Local models hallucinate extra links 49% of the time on those tests, compared with 33% for cloud models. Minimum-cardinality tests are easier for both. That pattern is familiar beyond System Dynamics: LLMs are often biased toward being comprehensive. They would rather add a plausible relationship than obey a strict upper bound. Helpful, in the way an enthusiastic intern is helpful before anyone gives them a checklist.
Iteration is where the local story becomes less comfortable
The sharpest cloud/local divide is not translation. It is iterative model building.
Here, the task is not to create a fresh CLD from text. The model receives an existing CLD plus a passage describing additions, and it must return an updated CLD while preserving the previous relationships. That preservation requirement changes the cognitive burden. The model must distinguish “what already exists” from “what should be added,” while resisting the temptation to rewrite the whole structure.
Cloud models dominate this category. Gemini 3.1 Pro and GPT-5.1 score 8/8. Gemini 2.5 Flash scores 7/8. Claude Opus 4.5 scores 6/8.
Most local models struggle badly. Kimi K2.5 scores only 1/8 in its best CLD configuration. Qwen 3.5 and Llama 4 Maverick score 0/8. DeepSeek V3.2 Q4_K_M reaches 4/8, which is useful but still below cloud leaders.
Then comes the odd result: GLM-5 MLX-4, a 9B model, scores 6/8. It matches Claude Opus 4.5 on this category and outperforms much larger local models.
That result should not be misread as “small models are better.” It says something more specific: iterative structured editing is a distinct capability. It may depend on training examples, architectural memory behavior, or a model’s ability to separate prior context from new instructions. The paper does not prove the mechanism, and it says so. But it does show that parameter count is a poor proxy for this category.
The observed failure mode is also useful. Local models tend to drop pre-existing relationships while hallucinating new feedback edges that “close loops” implied by the new text. In other words, they behave as if the task were fresh extraction rather than controlled revision. This matters for business workflows because many enterprise AI tasks are editing tasks disguised as generation tasks: updating a compliance map, revising a process diagram, modifying a risk register, extending a database schema. If the model cannot preserve what already exists, it is not an assistant. It is a very expensive reset button.
Discussion tasks expose the context problem
The discussion benchmark evaluates the assistant as a System Dynamics mentor or critic. It covers model-building steps, feedback explanation, and error-fixing suggestions.
The results are mixed in a useful way.
For model-building steps, Kimi K2.5 GGUF Q3, DeepSeek V3.2 Q4_K_M, and Kimi K2.5 MLX-3bit all reach 100% in their evaluated groups. DeepSeek is faster than Kimi in this category, likely because it answers more directly without extended reasoning chains.
For feedback explanation, Kimi K2.5 MLX-3bit reaches 75% on the simple groups, outperforming Kimi GGUF Q3 at 67%. That result is interesting because the same MLX backend has major limitations elsewhere. For shorter explanatory tasks, it can be useful.
For error fixing, only Kimi K2.5 GGUF Q3 and DeepSeek V3.2 Q4_K_M achieve non-zero scores, both at 50%. The category uses very long prompts—80,000 to 146,000 tokens—because the model must inspect full model context and diagnose formulation errors. That is where local deployment stops being a tidy performance comparison and becomes a hardware/backend problem.
The paper documents a hard failure mode for mlx_lm on these long prompts: Metal GPU command buffer out-of-memory errors, even on a Mac Studio with 512 GB unified memory. The issue is not simply system RAM. It is how the backend handles attention and GPU command memory. llama.cpp avoids this specific failure mode because it manages the KV cache differently, but it has its own risks, including grammar-sampling hangs on long-context dense models.
The business lesson is not “local cannot do long context.” It is more precise: local long-context reliability depends heavily on backend architecture and model type. For error fixing, especially when prompts are above 80k tokens, cloud or a carefully engineered GGUF route remains safer. For shorter coaching and explanation tasks, local models are already useful.
Backend choice is not plumbing; it is model behavior in production
One of the paper’s strongest practical findings is that backend choice can matter more than quantization level.
That sentence should make infrastructure teams happy and product teams nervous. Happy, because there is leverage in engineering. Nervous, because it means a model evaluation that ignores serving backend may be measuring the wrong thing.
The paper compares GGUF/llama.cpp and MLX/mlx_lm configurations across matched model families. The differences are not cosmetic.
| Deployment factor | Paper finding | Business interpretation |
|---|---|---|
| llama.cpp JSON schema enforcement | Supports grammar-constrained JSON schema sampling | Strong for structured output, especially when schema validity is non-negotiable |
| mlx_lm JSON handling | Silently ignores response_format; requires explicit JSON-only prompt instructions |
Cloud-style harnesses can fail unless adapted for MLX |
| mlx_lm context handling | No simple context-size flag; long prompts can trigger Metal OOM | Avoid for very long-context tasks unless tested under production-length prompts |
| llama.cpp long-context dense models | Grammar sampling can hang on dense long-context models | Reliable JSON does not mean universally safe serving |
| Quantization | Q4/MLX-4+ appears to have limited accuracy impact at large scale for these tasks | Do not overfocus on bit-width when backend and task category dominate |
The JSON issue is especially important. llama.cpp can enforce a JSON schema at the token level. mlx_lm, in the tested version, silently ignores response_format fields. Without explicit JSON instructions in the prompt, the model returns free-text narrative responses, causing bad JSON failures. After adding an explicit JSON-only instruction with exact field names, MLX models produce valid JSON reliably.
This is not a minor integration note. Many enterprise evaluation harnesses are designed around cloud APIs and assume that response_format works as advertised. Move the same harness to local inference and the failure may look like model incompetence when it is actually backend mismatch. The model is not “bad at JSON”; the server ignored the contract. A tiny implementation detail, naturally, because production systems enjoy hiding cliffs under friendly abstractions.
The model-router is the real product architecture
The paper’s most important business implication is task routing.
A single local model does not beat the best cloud model overall on CLD extraction. But if each category is routed to the best-performing local model after observing results, the post hoc local composition reaches 91%, compared with 89% for the best single cloud model. The routed composition uses DeepSeek V3.2 MLX-4 for conformance, Kimi K2.5 for causal reasoning and translation, and GLM-5 for iteration.
The paper is careful: this is a post hoc upper bound, not a deployed end-to-end router. A real system would need a reliable classifier that decides whether a query is conformance, translation, iteration, or causal reasoning before sending it to the right model. The router itself would need to be tested prospectively.
Still, the result is valuable because it changes the design question. Instead of asking whether local models can replace cloud models wholesale, the paper asks whether enough local capability exists in the model pool to support a routed deployment. For this benchmark, the answer is yes, with caveats.
A practical enterprise architecture might look like this:
| Task type | Preferred route suggested by the paper | Why |
|---|---|---|
| CLD translation from text | Kimi K2.5 GGUF Q3, zero-shot, low temperature | 23/24 translation; competitive with cloud leaders |
| Schema conformance checks | DeepSeek V3.2 MLX-4 or Kimi K2.5 | Local conformance can match or exceed cloud results |
| Iterative CLD updates | GLM-5 locally, or cloud fallback for maximum reliability | GLM-5 is the local outlier at 6/8; cloud leaders reach 8/8 |
| Short model-building coaching | Kimi K2.5 or DeepSeek V3.2 | Both reach 100% on model-building steps |
| Feedback explanation | Kimi K2.5 variants, with context-length awareness | MLX-3bit performs well on simple feedback tasks |
| Long-context error fixing | llama.cpp route or cloud fallback | 80k–146k prompts expose local backend limits |
This is the opposite of the one-model fantasy. The serious product is not “use our model.” It is “route this work to the minimum sufficient model and backend, with a fallback when the task changes shape.”
For Cognaptus-style automation, that architecture is more interesting than a leaderboard. Business process automation is full of category shifts: extraction, validation, exception handling, revision, explanation, escalation. Treating all of these as the same “AI task” is how teams end up with impressive demos and brittle workflows.
What the paper directly shows, and what Cognaptus infers
The paper directly shows that local models can be competitive with cloud APIs on System Dynamics CLD extraction, especially translation and conformance. It also shows that iteration and long-context error fixing remain harder. It documents backend-specific implementation failures and provides a practitioner guide for running very large local models on Apple Silicon.
From that, we can infer a business pathway, but the inference should stay bounded.
| Layer | What is supported | Boundary |
|---|---|---|
| Direct paper result | Local models reach 77% single-model CLD performance; task-routed local upper bound reaches 91% | The routed result is post hoc and not an implemented production router |
| Business inference | Local-first structured extraction is credible for privacy-sensitive and domain-specific workflows | Results are strongest for System Dynamics-style structured causal tasks |
| Direct paper result | Backend behavior affects JSON reliability, context handling, and latency | Tested backends and versions may change over time |
| Business inference | Evaluation must include serving backend, prompt engine, and context length, not just model name | Requires organisation-specific benchmarking |
| Direct paper result | GLM-5, despite 9B parameters, is the strongest local iteration model | Mechanism is not established; could be training data, architecture, or evaluation fit |
| Business inference | Smaller specialist models may be valuable inside a router | Do not generalize “small beats large” outside the tested category |
| Direct paper result | Energy and edge-appliance appendices suggest local deployment can be attractive under utilisation assumptions | Scenario analysis, not direct metered fleet economics |
The practical message is not that every company should buy a Mac Studio rack and declare independence from cloud APIs. The message is that a serious AI deployment should be task-benchmarked under realistic infrastructure conditions.
That means testing the actual prompt length, actual schema requirements, actual backend, actual fallback behavior, and actual failure recovery. A model that wins a public leaderboard but fails JSON under your local server is not the best model for your business. It is an expensive misunderstanding.
The edge-appliance idea is plausible, but not yet automatic
The paper’s appendices extend the argument into edge AI appliances: local inference devices deployed at a university, agency, research group, or enterprise office. The benchmark hardware—a Mac Studio M3 Ultra with 512 GB unified memory—is treated as a practical example of such an appliance.
This matters for regulated or sensitive workflows. Healthcare system models, defense logistics, commercial strategy maps, and offline research environments are not always suitable for third-party cloud APIs. In those settings, a slightly weaker local model may be more useful than a stronger cloud model that governance will not allow.
But the appliance story should not be oversold. The paper’s hardware-tier analysis includes projections for entry and mid-tier devices, not direct benchmark measurements. The energy analysis uses scenario assumptions about power draw, utilisation, output token counts, and cloud infrastructure. The task-routing appliance remains a product concept requiring engineering work.
Still, the direction is significant. If a local device can handle structured extraction, conformance, and some coaching tasks at competitive performance, then “on-prem AI” stops meaning a miniature data centre. It starts meaning a domain-specific appliance with a measured task portfolio.
That is a different procurement conversation. Less glamorous than buying access to a famous API. More useful for organisations whose actual problem is not glamour.
The boundaries that matter
Several limitations affect how far the results should travel.
First, the benchmark is domain-specific. System Dynamics CLD extraction is structured, causal, and schema-heavy. That makes it relevant to process mapping, policy modelling, compliance mapping, and decision-support systems, but it does not automatically generalize to creative writing, legal reasoning, medical diagnosis, coding agents, or customer support.
Second, the runs use a single seed, and the causal reasoning category has only three tests. A single test represents 33 percentage points. The causal reasoning results are useful signals, not statistical granite.
Third, hardware matters. The largest local models were tested on a high-memory Apple Silicon workstation. Most offices do not casually have 512 GB unified-memory machines waiting under the ficus. Lower-tier hardware may support smaller models, but the paper’s entry and mid-tier claims are partly projected.
Fourth, the infrastructure failures are version-specific. mlx_lm, llama.cpp, LM Studio, model templates, and model families evolve quickly. A failure mode documented in this benchmark may be fixed later, and a new one may appear with perfect comedic timing.
Finally, the task-router result is not a deployed system. It is a measured opportunity. The next serious step would be a prospective router that classifies incoming requests, sends them to the chosen local or cloud path, and reports end-to-end performance including routing errors.
Conclusion: cloud superiority is now conditional
This paper is useful because it does not give us the easy answer.
It does not prove that cloud APIs are obsolete. They are not. Cloud models still lead on several difficult tasks, especially iterative editing and long-context diagnosis. They also offer serving maturity, operational elasticity, and lower setup burden.
It also does not let local models stay in the privacy-only corner. On structured CLD translation and conformance, local models are already competitive and sometimes superior. The best local single model reaches mid-tier cloud performance. The best category-routed local composition suggests that the remaining gap is not simply model quality. It is architecture.
The next enterprise AI stack will probably not be cloud-only or local-only. It will be routed. Local models will handle structured extraction, schema conformance, and sensitive internal workflows. Specialist models will handle narrow editing or coaching categories. Cloud models will remain fallbacks for long-context, high-uncertainty, or maximum-reliability tasks.
That is less dramatic than “the API is dead.” It is also more disruptive. Once cloud superiority becomes conditional, procurement changes. Benchmarking changes. Product architecture changes.
And somewhere, a one-model vendor pitch becomes just a little harder to say with a straight face.
Cognaptus: Automate the Present, Incubate the Future.
-
Terry Leitch, “Benchmarking System Dynamics AI Assistants: Cloud Versus Local Large Language Models on CLD Extraction and Discussion,” arXiv:2604.18566v2, April 2026, https://arxiv.org/abs/2604.18566. ↩︎