TL;DR for operators

A model upgrade is not a software patch. It is closer to changing the interpreter under a production system while hoping every old script still means the same thing. Charming, in the way live wires are charming.

The paper behind this article, Prompt Migration: Stabilizing GenAI Applications with Evolving Large Language Models, studies that problem through Tursio, an enterprise search application that converts natural-language questions into structured operator trees for database querying.1 Tursio’s old prompts were fully stable on GPT-4-32k. When the same prompts were run against GPT-4.1, tests passed at 98%. Against GPT-4.5-preview, they passed at 97.3%. That sounds minor until the application is generating SQL-like structures, where “almost correct” is not a governance model.

The paper’s useful contribution is not another pile of prompt-engineering advice. It reframes prompt migration as a lifecycle discipline. New model versions introduce behavioural drift; prompts that once relied on implicit inference may need to be rewritten into structured, explicit, testable interfaces. Tursio’s redesigned GPT-4.1 prompts restored test results to 100%, and the team built a 110-question migration testbed from production workloads to support future upgrades.

For operators, the message is direct: production prompts are not disposable text. They are application logic. They need versioning, migration plans, regression tests, ownership, and release gates. If your LLM application depends on “the model usually understands what we mean,” your next upgrade has already booked a meeting with you. It just has not sent the calendar invite yet.

The upgrade broke the contract nobody wrote down

The familiar business story is simple. A vendor releases a newer model. The benchmark slides look better. The context window is larger. Instruction-following has improved. Hallucination rates are supposedly lower. The procurement and product teams hear “upgrade.” Engineering hears “please enjoy your weekend incident.”

Tursio’s case is useful because it strips away the theatre. The application was not a toy chatbot trying to write cheerful refund emails. It was an enterprise search system that used LLMs to interpret natural-language questions into structured operator trees: select, project, filter, join, aggregate, and related reasoning operations. Those trees then fed into structured data systems such as SQL Server, Azure SQL, Snowflake, Databricks, BigQuery, Teradata, Cassandra, and others.

That matters because the LLM was not merely producing prose. It was helping assemble executable analytical intent. A missed filter, malformed column, or unnecessary ORDER BY is not a stylistic wobble. It changes the computation.

The system was originally built on GPT-4-32k in 2023. The paper reports that Tursio had more than 100 deployed instances, thousands of queries per day, and roughly 33 LLM calls per query on average. Each query carried about 6,000 input tokens and 1,000 output tokens, while training runs incurred more than 1,000 LLM calls. In other words, this was not “we tried a prompt in a notebook.” It was a model-dependent application stack with enough surface area for small behavioural shifts to become operational debt.

Then the lifecycle clock started ticking. GPT-4-32k was deprecated on June 6, 2024, with shutdown scheduled one year later. Tursio migrated to GPT-4.5-preview, which itself was deprecated on April 14, 2025, with a shutdown date only three months later. The current system runs on GPT-4.1. The timeline is the point: model migration was not an optional optimisation project. It was forced maintenance.

The old prompts had passed Tursio’s existing test suite on GPT-4-32k. On newer models, the same prompts did not behave the same way.

Model Old prompts: tests passed Interpretation
GPT-4-32k 100% Baseline stability on the model the prompts were tuned for
GPT-4.1 98% Small numerical drop, but enough to break CI/CD expectations
GPT-4.5-preview 97.3% Larger drift, including output-format and parser-facing failures

The obvious but wrong reading is: “Only two or three percent failed, so this is manageable.” That is spreadsheet comfort. In production workflows, failure distribution matters more than average pass rate. If the failed cases sit in high-value queries, compliance-sensitive workflows, executive dashboards, or automated decisions, the difference between 100% and 98% is not a rounding error. It is a release blocker wearing a small hat.

The old prompt relied on the model being generously implicit

The paper gives a simplified version of an old prompt for extracting SQL filters. Its spirit was compact: extract filters from input, use underscores for column names, do not quote column names, then rely on few-shot examples.

That kind of prompt works when the model fills in the missing operational contract. The model sees a fragment, infers intent, maps vague wording to schema-compatible operators, keeps the output parser happy, and generally behaves like a cooperative analyst who has read the internal manual even though nobody sent it.

GPT-4-32k apparently did enough of that for Tursio’s existing test suite. The newer models did not preserve the same implicit behaviour.

The failures fell into several categories:

Failure type What happened Why it matters operationally
Missing ordering or grouping Fragments implying ORDER BY or GROUP BY were missed or mishandled Trend and aggregation queries can silently change meaning
Incorrect column names The model inferred columns that did not exist, or simplified valid names incorrectly Downstream execution fails, or worse, uses the wrong field
Semantic misreading A schema term such as TIMES_LATE could be mistaken for a time dimension Surface-level language associations override domain semantics
Ambiguous or absent filters Filters that should have been inferred were omitted Queries become broader, less accurate, or misleading
Output-format drift GPT-4.5-preview sometimes produced parser-hostile messages or formats The application contract breaks before business logic can run

The important mechanism is subtle. Newer models were not simply “worse.” GPT-4.1, according to the paper’s reading of model guidance, placed more weight on explicit instructions and less on unsupported inference. That can be desirable in general. Businesses do not usually want a model improvising with confidence. But if the old prompt depended on the model improvising in exactly the right way, improved instruction discipline becomes a breaking change.

This is the uncomfortable part. A model can become better by benchmark standards and still become worse for a particular application contract. Not because benchmarks are useless, but because your prompt encoded a private protocol the benchmark never measured.

The failure was small in percentage and large in meaning

The Tursio results are best read as main evidence, not as a broad benchmark. The table does one job: it demonstrates that a real production-style prompt suite can lose reliability when moved across model versions, even when the application and prompts remain unchanged.

It does not prove that GPT-4.1 or GPT-4.5-preview are generally less reliable than GPT-4-32k. It does not prove that all enterprise search systems will see the same failure rate. It does not even prove that every Tursio prompt category drifted equally. The paper does not provide that level of distributional detail.

What it does show is operationally enough: old prompt stability did not transfer automatically.

The more interesting evidence is the failure analysis. The errors cluster around implicit interpretation and output contract adherence. That tells us the migration problem is not just “rewrite the prompt to be longer.” It is “identify the invisible assumptions the older model had been satisfying, then make those assumptions explicit enough for the newer model.”

This is the difference between prompt editing and prompt porting.

Prompt editing asks: “Can we make the answer better?”

Prompt porting asks: “Can we preserve application behaviour under a changed model runtime?”

The second question is harder because it is not about elegance. It is about compatibility.

The fix was a structured interface, not a magic incantation

Tursio’s new prompts became more explicit. The redesigned filter-extraction prompt specified a required output format, formatting rules, square-bracket structure, comma separation, underscore handling, no quoted column names, implicit-column inference, value matching, similar-question reference, and empty-filter behaviour. It also included examples.

That is not prompt poetry. It is interface design.

The old prompt assumed the model could infer both the task and the contract. The new prompt separated the two: what to do, how to format it, what to infer, what not to infer, what to return when nothing applies, and how examples should guide behaviour.

The paper reports that using the new prompts with GPT-4.1 restored the tests to 100%. That is the strongest practical result in the case study. It shows that the observed drift was not inevitable model failure. It could be corrected by aligning prompts with the newer model’s behavioural expectations.

Still, the result should be read precisely. The paper reports full correction for GPT-4.1 after prompt redesign. It does not present the same restored result for GPT-4.5-preview. It also does not provide a full ablation showing which prompt components mattered most: output format, examples, explicit inference instructions, formatting rules, or their combination. So the safe interpretation is not “make prompts verbose and everything works.” The safe interpretation is: structured prompt redesign can recover reliability when failures come from underspecified application contracts.

A useful internal migration review would classify the paper’s evidence like this:

Paper component Likely purpose What it supports What it does not prove
Regression table across GPT versions Main evidence Old prompts did not transfer cleanly across model versions General superiority or inferiority of any model
Failure categories Diagnostic evidence Drift centred on implicit interpretation, schema mapping, and output contracts Complete taxonomy of all LLM migration failures
Redesigned GPT-4.1 prompt Intervention evidence More explicit structure restored test stability for this case Which exact prompt element caused the recovery
110-question migration testbed Implementation contribution Migration needs broader workload coverage than current regression tests Fully automated readiness testing
Prompt lifecycle figure Conceptual framework Prompt maintenance sits between model and application lifecycles A mature industry standard

The table is not the whole paper. The operational insight is in the sequence: fail, diagnose, restructure, test, repeat. That sequence is what turns prompt migration from emergency craft into engineering practice.

The migration testbed is the least glamorous and most useful part

The paper’s most valuable artefact may be the migration testbed, precisely because it is boring. Boring is what production reliability looks like when it grows up.

Tursio’s existing regression tests protected current workload stability. But the authors argue that regression tests alone do not prove readiness for a new model. A migration testbed should represent the workload space more broadly, including easy, moderate, and hard cases derived from production patterns.

Tursio built a 110-question migration testbed from hundreds of production workloads. The paper divides it into three levels:

Complexity level Construction logic Example of what it stresses
Easy Queries involving operators such as Filter, GroupBy, OrderBy, and Limit, including combinations of one or two operators Basic operator extraction and formatting
Moderate Semantically similar variations of easy questions, including prior failures and cached/unit-test samples Robustness to wording variation
Hard Queries with implicit columns, filters, aggregations, ordering, synonym substitutions, and more natural wording Hidden inference and schema-sensitive interpretation

This is not just test coverage. It is a model-readiness instrument.

The distinction matters. A standard regression suite asks, “Did we break what already worked?” A migration testbed asks, “Where will this new model reinterpret our workload contract?” Those are related but not identical questions.

The paper also admits the cost. Creating meaningful queries requires understanding the underlying dataset. Join-heavy queries were underrepresented beyond existing unit tests because they depend on specific application scenarios. The testbed can be semi-generated, but still needs manual inspection for sanity and meaningfulness.

That limitation is not embarrassing. It is the whole point. Prompt migration is partly a domain-modelling exercise. You cannot fully test whether a natural-language-to-query system preserves meaning unless someone knows what the language and the data are supposed to mean. The machines remain annoyingly dependent on the people who understand the business. A scandal, apparently.

Prompt lifecycle management is the missing operating model

The paper’s framework is simple:

  1. Run the migration testbed on the new model.
  2. Identify failed tasks.
  3. Update the corresponding prompts.
  4. Use the model’s prompting guide to learn migration patterns.
  5. Re-run the migration testbed until it passes.
  6. Run regression tests against current workload stability.
  7. Extend the migration testbed when regression failures reveal uncovered cases.

This is not exotic. It looks like ordinary software migration discipline, translated into the LLM layer. That translation is overdue.

The authors describe prompt lifecycle as intertwined with model lifecycle and application lifecycle. Application releases create or tune prompts. Model releases force prompt migration. Reliability requirements create regression tests. Prompt migration creates model stabilization. The diagram in the paper is conceptually straightforward, but it captures something many organisations still treat as an awkward side activity: prompts are not separate from software architecture.

The reported effort is also revealing. Tursio’s first migration took a couple of months end to end. With the structured framework, the authors say the process can come down to a couple of weeks. That is not free. It is better than panic.

For business planning, this changes how model upgrades should be budgeted. The cost is not only token pricing, latency, or benchmark improvement. It includes:

Operating requirement What changes in practice
Prompt inventory Know which prompts exist, which workflows they affect, and which model versions they were tuned against
Version control Treat prompt changes like application changes, with reviewable diffs and rollback paths
Migration testbeds Maintain workload-representative tests beyond narrow regression cases
Release gates Block model upgrades until migration tests and current workload regression tests pass
Ownership Assign responsibility across AI engineers, application engineers, and domain experts
Monitoring Watch not only answer quality, but parser errors, schema mismatches, missing filters, and implicit-operation failures

The awkward organisational lesson is that prompt management is not simply a skill for “AI people.” It sits between product semantics, application logic, data modelling, and reliability engineering. It needs people who can read a business query, understand the schema, inspect the generated operator tree, and know whether the result is merely plausible or actually correct.

That is a much less glamorous job description than “AI strategist.” It is also more useful.

Better models can remove the helpful bad habits you relied on

The paper’s most interesting conceptual twist is that model improvement can break applications by removing behaviours developers had quietly depended on.

Older GPT-4 behaviour, in this case, seems to have been more forgiving of flexible prompts and more willing to infer implicit meaning. Newer GPT-4.1 and GPT-4.5-preview required more structured and explicit prompting to reach full accuracy. The paper phrases this as models getting smarter in some ways and “dumber” in others. A less theatrical version: models optimise for different behavioural contracts over time.

For application teams, the problem is not intelligence. It is compatibility.

A model that follows instructions more literally may be better for safety and controllability, but worse for prompts that expected it to infer unstated domain conventions. A model with stronger general reasoning may still produce output that violates a parser. A model with lower hallucination rates may be less willing to infer a filter unless the prompt explicitly tells it when and how to do so.

This is why “we will just switch to the latest model” is not a strategy. It is a superstition with an invoice.

The business correction is straightforward:

Reader belief Better replacement belief Why it matters
Newer models are drop-in replacements Newer models are changed runtimes Compatibility needs testing
Prompt engineering is about finding clever wording Prompt migration is about preserving behaviour across versions Stability needs process, not vibes
Regression tests are enough Migration tests must probe workload space and implicit semantics Current tests may miss future drift
Small pass-rate drops are tolerable Failure location matters more than aggregate percentage Some failures carry disproportionate business risk
Prompt verbosity is the answer Explicit contracts are the answer Longer prompts can still be wrong if they specify the wrong things

The practical aim is not to freeze model versions forever. That would be brittle in another direction. The aim is to make upgrades boring: expected, tested, documented, reversible, and owned.

What the paper directly shows, and what Cognaptus infers

The paper directly shows three things.

First, in this Tursio case, old prompts that were stable on GPT-4-32k did not fully transfer to GPT-4.1 or GPT-4.5-preview. The pass rates fell to 98% and 97.3%, respectively.

Second, the failures were interpretable. They involved missing ordering and grouping, incorrect or simplified column names, semantic misreadings, absent filters, output-format problems, parser-hostile messages, and redundant or malformed operations.

Third, structured prompt redesign restored GPT-4.1 testing to 100%, and Tursio’s team developed a 110-question migration testbed and iterative migration workflow to support future changes.

Cognaptus infers the operational lesson: model migration should be treated as a release-management process. This includes prompt inventories, testbeds, explicit output contracts, migration owners, and business-risk classification for failures. The paper does not say every company must copy Tursio’s exact process. It shows enough to justify treating prompt migration as a first-class reliability discipline wherever LLMs sit inside production workflows.

What remains uncertain is the generality. The evidence comes from one enterprise search application, one product architecture, and a small set of GPT-family transitions. The migration testbed was manually inspected and only semi-generated. The paper does not provide a broad benchmark across vendors, domains, prompt types, or application categories. It also does not isolate the causal effect of each prompt redesign component.

So the right reaction is neither dismissal nor overreach. Do not say, “This proves all model upgrades are dangerous.” Also do not say, “This is just one anecdote.” The case is a credible warning from a system where LLM behaviour is deeply embedded in application logic. That is exactly the kind of anecdote operators should respect, because production failures often begin life as “only one case.”

The boundary: this is a case study, not a universal law

The paper is strongest as a practitioner case study. It is weaker as a general empirical benchmark.

Several boundaries matter.

The application domain is enterprise search over structured data. That makes certain failure types especially salient: filters, columns, ordering, grouping, output schemas, and parser compatibility. A creative-writing assistant, legal summarisation tool, coding agent, or customer support bot would have a different failure profile.

The models are from a narrow family. The paper discusses GPT-4-32k, GPT-4.1, and GPT-4.5-preview. It does not test equivalent migrations across Claude, Gemini, Llama, Mistral, DeepSeek, or smaller specialist models. The broader concept probably travels; the exact failure modes may not.

The testbed is partly manual. That is realistic, but it means the process depends on domain expertise and engineering judgement. Automation may accelerate prompt rewriting, but the paper does not yet demonstrate a fully automated migration pipeline.

The restored 100% result applies to GPT-4.1 in the reported setup. It should not be casually extended to all newer models, all prompts, or all workloads. More structured prompts helped here because the observed failures came from underspecification and changed inference behaviour. If failures come from latency limits, tool-use bugs, retrieval defects, missing context, or model incapacity, prompt restructuring alone may not be enough.

Finally, the paper contains implementation lessons such as example-format preferences and XML-style delimiters. These are useful, but they are not presented as controlled ablations. Treat them as field notes, not commandments from Mount Markdown.

What operators should do before the next forced upgrade

The immediate business use of this paper is a readiness checklist.

Start with a prompt inventory. List production prompts, their owning workflow, model dependency, expected output format, downstream parser, business-criticality level, and last validation date. If nobody can produce this list, the organisation does not have prompt management. It has archaeology.

Next, separate regression tests from migration tests. Regression tests protect known behaviour. Migration tests should include workload diversity: easy cases, paraphrases, prior failures, implicit operations, edge cases, schema-sensitive language, and natural wording that real users actually type.

Then formalise output contracts. For any LLM output consumed by software, specify structure, allowed values, empty-case behaviour, error handling, and examples. The more downstream automation depends on the response, the less tolerance there should be for charming conversational freelancing.

Model release monitoring should become part of product operations. Deprecation windows are planning deadlines. A one-year retirement window is generous until an application has hundreds of prompts and sparse ownership. A three-month window is a sprint with legal paperwork.

Finally, assign migration ownership. AI engineers can evaluate model behaviour. Application engineers can trace downstream failures. Domain experts can judge whether generated operators preserve business meaning. Product owners can decide which failure classes block release. Finance can budget the work instead of discovering it through incident cost. Everyone gets a chair. Nobody gets to call it “just prompt tweaking.”

Conclusion: old prompts are legacy code now

The Tursio paper lands because it gives a name to a problem many teams are already experiencing badly and quietly. Prompt migration is what happens when production LLM applications meet model churn. The old promise was that natural language would make software more flexible. The new reality is that natural language has become part of the software surface, and it needs maintenance like everything else.

The most valuable correction is mental. A prompt is not merely an instruction to a model. In production, it is a behavioural contract between user intent, model interpretation, application logic, data semantics, and downstream execution. When the model changes, that contract must be revalidated.

Better models are still worth adopting. Larger context windows, stronger instruction following, better reasoning, lower hallucination rates, and improved tool use can all matter. But “better” is not the same as “compatible.” The upgrade cycle will not slow down just because your prompts are sentimental.

So port them. Test them. Version them. Make the migration boring.

That, in enterprise AI, is what sophistication looks like when it stops wearing a conference badge.

Cognaptus: Automate the Present, Incubate the Future.


  1. Shivani Tripathi, Pushpanjali Nema, Aditya Halder, Shi Qiao, and Alekh Jindal, “Prompt Migration: Stabilizing GenAI Applications with Evolving Large Language Models,” arXiv:2507.05573, 2025. https://arxiv.org/abs/2507.05573 ↩︎