Merge Without a Mess: Adaptive Model Fusion in the Age of LLM Sprawl

Models pile up quietly.

A customer-support model here. A finance QA model there. A legal drafting variant that nobody wants to delete because it passed last quarter’s evaluation. A sales assistant fine-tuned on a dataset that may or may not still represent how the company sells. Then come LoRA adapters, instruction-tuned checkpoints, safety-tuned variants, regional versions, and a few “temporary” experiments that become permanent because nobody enjoys breaking production on a Friday.

This is how model sprawl happens. Not with a grand architectural decision, but with a directory full of checkpoints and a spreadsheet pretending to be governance.

The paper behind this article, AdaMerging: Adaptive Model Merging for Multi-Task Learning, addresses a deceptively simple question: if an organization already has several task-specialized models derived from the same base model, can it combine them into one multi-task model without retraining on the original training data?¹ The answer is not “just average the weights.” That answer is popular because it is cheap, not because it is consistently intelligent. AdaMerging’s contribution is more precise: it turns merging coefficients into learnable objects, using unlabeled data and entropy minimization to decide how much each task vector should contribute.

That sounds technical because it is. But the business idea is quite plain: the value is not merely cheaper deployment. The value is cheaper diagnosis of whether specialized model variants can be consolidated at all.

The real problem is not duplication; it is interference

The old version of this article treated model merging mainly as a way to reduce the number of deployed model artifacts. That is true, but incomplete. Storage is the easy problem. Serving cost is a real problem. Version control is a tedious problem. The hard problem is interference.

Model merging usually begins with a shared base model $\theta_0$ and several fine-tuned variants. Each fine-tuned model can be represented as a task vector:

$$ \tau_i = \theta_i - \theta_0 $$

where $\theta_i$ is the model after fine-tuning on task $i$. Task arithmetic showed that these vectors can be added, subtracted, and combined to steer model behavior.² In the simplest multi-task version, one might try:

$$ \theta_{\text{merged}} = \theta_0 + \lambda \sum_i \tau_i $$

The seductive part is obvious. No retraining from scratch. No original training data. No ensemble latency. One model instead of many.

The ugly part is also obvious once production gets involved: task vectors are not polite. A parameter update that improves one task may damage another. Two models may push the same parameter in opposite directions. Some updates may be redundant; others may be important but fragile. TIES-Merging, for example, explicitly frames this as an interference problem, identifying redundant small updates and sign conflicts as major reasons why naïve merging loses useful information.³

This is the central misconception worth correcting: model merging is not a compression trick. It is a conflict-resolution problem conducted in weight space, where the participants cannot explain themselves and the mediator is usually a scalar coefficient. Very modern. Very efficient. Slightly alarming.

AdaMerging makes the coefficient the object of learning

AdaMerging starts from the same task-vector logic but changes the key design choice. Instead of assuming a fixed global coefficient, it learns merging coefficients automatically.

There are two main versions.

Version	What it learns	Operational meaning
Task-wise AdaMerging	One coefficient per task vector	Some specialized models should influence the merged model more than others
Layer-wise AdaMerging	One coefficient per task vector per layer	A task may be useful in some representation layers and harmful in others

The second version matters more than it first appears. If a model fine-tuned for satellite-image classification and another fine-tuned for traffic-sign recognition share some visual features but diverge in later classification behavior, a single coefficient is crude. Layer-wise coefficients allow the merge to preserve useful shared representations while reducing harmful task-specific interference.

The paper’s mechanism can be read as a small but important shift:

$$ \theta_{\text{merged}} = \theta_0 + \sum_i \alpha_i \tau_i $$

becomes, in the layer-wise case, closer to:

$$ \theta_{\text{merged}}^{(l)} = \theta_0^{(l)} + \sum_i \alpha_i^{(l)} \tau_i^{(l)} $$

where $l$ indexes layers.

That is the technical move. The business interpretation is that not every specialization deserves equal voting rights. Some checkpoints are useful donors. Some are noisy donors. Some are useful only in particular layers. If an AI team treats all of them as equally valuable because they all came from the same base model, it has mistaken lineage for compatibility.

Entropy minimization is the clever shortcut, not magic

The awkward constraint in model merging is that original task training data may be unavailable. In enterprises, this is not a rare edge case. Data may be restricted by privacy rules, licensing terms, contractual boundaries, retention policies, or simple organizational forgetfulness. “Where is the training data?” is often the question that turns a model registry into an archaeology project.

AdaMerging avoids relying on original training data by using unlabeled test samples and entropy minimization. The intuition is straightforward: if a model is uncertain across its output distribution, its predictions have high entropy; if it is more decisive, entropy is lower. The paper uses entropy as a surrogate objective for adjusting merging coefficients.

This should not be oversold. Low entropy does not always mean correctness. A model can be confidently wrong, as anyone who has used a chatbot for legal interpretation already knows. The paper’s claim is narrower: in the tested setting, entropy correlates sufficiently with prediction loss to guide coefficient learning. That makes entropy useful as an optimization proxy, not a universal truth serum.

This distinction matters for deployment. In low-risk classification workflows, unlabeled operational samples may be enough to tune a merge candidate. In high-stakes applications, entropy-based tuning should be treated as a screening method before supervised evaluation, not as a replacement for it.

The main result is coefficient discipline, not model alchemy

The evidence in AdaMerging is strongest when read against the right baselines. The paper evaluates multi-task model merging on eight image-classification datasets using CLIP-based vision models. The main ViT-B/32 table compares separate fine-tuned models, traditional multi-task learning, simple weight averaging, Fisher merging, RegMean, Task Arithmetic, TIES-Merging, and AdaMerging variants.

The average accuracy results are the useful part:

Method	Average accuracy on eight tasks
Separate individual fine-tuned models	90.5
Traditional multi-task learning	88.9
Weight averaging	65.8
Fisher merging	68.3
RegMean	71.8
Task Arithmetic	69.1
TIES-Merging	72.4
Task-wise AdaMerging	71.1
Task-wise AdaMerging++	73.7
Layer-wise AdaMerging	80.1
Layer-wise AdaMerging++	81.1

The important comparison is not against the individual models. Separate task-specific models still win because they do not suffer cross-task interference. Traditional multi-task learning also remains stronger, but it requires access to the task data and joint training.

The more meaningful comparison is against other data-free merging methods. Layer-wise AdaMerging raises average accuracy from 69.1 under Task Arithmetic to 80.1. Layer-wise AdaMerging++ raises the TIES-Merging baseline from 72.4 to 81.1. That is not a rounding error. It says the coefficient problem is not cosmetic; it is central.

The result also clarifies where the benefit comes from. Task-wise coefficients help, but the large jump comes from layer-wise adaptation. In other words, the merge improves when the method stops asking, “How much should this task matter?” and starts asking, “Where in the network should this task matter?”

That is the expensive part of the argument cognitively, but it is also the practical insight. Model fusion is not a referendum among models. It is a layer-sensitive allocation problem.

The appendix tests cost and robustness, not a second thesis

A good paper appendix often prevents bad business interpretation. AdaMerging’s appendix does that.

First, the parameter overhead is tiny relative to the model updates being merged. For ViT-B/32, the paper reports 907,589,640 parameters across the eight task vectors. Task-wise AdaMerging adds only 8 trainable merging coefficients. Layer-wise AdaMerging adds 1,248 coefficients. So the method is not secretly training another large model while wearing a “data-free” costume.

Second, the time-cost analysis shows a useful trade-off. On a single GeForce RTX 3090, the paper reports that Layer-wise AdaMerging improves average accuracy from a base of 69.1 to 71.1 with an additional 7.5 minutes of coefficient training, and to 77.1 with 50 minutes. Longer optimization reaches higher scores, but the early gains are already operationally meaningful.

Third, the robustness tests matter because they shift the interpretation from “better benchmark score” to “less fragile under distribution shift.” On ViT-B/16 corruption tests, AdaMerging improves average accuracy over Task Arithmetic under several corrupted conditions, including motion blur, impulse noise, Gaussian noise, pixelation, spatter, contrast, and JPEG compression. The reported improvements range from 6.8 to 12.4 percentage points across those corruption types.

This does not prove that AdaMerging will make enterprise LLMs robust against messy user prompts, adversarial jailbreaks, or domain drift. It does show that adaptive coefficients can help under controlled distribution shifts. That is a useful result. It is not a magical insurance policy. Those are usually sold separately, with worse documentation.

Model fusion changes the operating model for AI teams

For business use, the direct lesson is not “merge everything.” The lesson is to treat fine-tuned models as a portfolio of reusable capability deltas.

A practical model operations team could use the following workflow:

Step	Question	Business value	Boundary
Inventory	Which checkpoints share the same base model and architecture?	Identifies merge candidates	Different architectures require other alignment methods
Compatibility test	Do task vectors interfere or reinforce each other?	Avoids destructive consolidation	Requires evaluation beyond headline accuracy
Adaptive merge	Can coefficients be learned from unlabeled or lightly labeled samples?	Reduces dependence on original training data	Entropy is only a proxy
Validation	Does the merged model preserve critical task behavior?	Protects production reliability	Needs task-specific acceptance tests
Governance	Can the merged artifact be documented and rolled back?	Simplifies audit and deployment	Provenance risks remain if source models are poorly tracked

This is where the ROI case becomes concrete. The savings do not come only from serving one model instead of five. They come from changing how experimentation is organized.

Without merging, every successful fine-tune becomes a new long-term liability: host it, monitor it, evaluate it, govern it, and explain it during the next audit. With disciplined merging, specialized checkpoints can become temporary research artifacts or reusable capability components. The organization can experiment modularly, then consolidate selectively.

That is a better operating model than “please do not touch the old checkpoint because nobody remembers why it works.”

The LLM boundary is sharper than the slogan

The title of this article mentions LLM sprawl, so the boundary must be stated carefully. AdaMerging’s main experiments are not a direct enterprise LLM deployment study. They focus on CLIP-style vision models and multi-task image classification. The paper is relevant to LLM operations because the same organizational problem exists: many fine-tuned variants, high serving costs, and pressure to reuse specialized capabilities. But relevance is not equivalence.

Later work on in-the-wild LLM merging is more cautious. A 2025 systematic study evaluates six model-merging methods across four open-weight LLMs, twelve fine-tuned checkpoints per base model, and sixteen benchmarks. It finds that Task Arithmetic is the only method among those tested that reliably yields gains in that heterogeneous LLM setting, while several more elaborate interference-aware or subspace methods often fail to improve over the base model.⁴

That finding should not be read as a rejection of AdaMerging. It should be read as a warning against lazy transfer. Merging models trained on cleanly separated tasks is easier than merging arbitrary community checkpoints or enterprise variants trained on overlapping, conflicting, or poorly documented objectives. In the wild, “specialized” often means “contaminated in different ways.”

For business teams, this creates a useful rule: the more controlled the fine-tuning pipeline, the more plausible adaptive merging becomes. The messier the checkpoint provenance, the more merging becomes an evaluation project rather than an infrastructure shortcut.

Where adaptive fusion fits beside LoRA and model soups

Adaptive merging also needs to be placed correctly among neighboring techniques.

Model soups showed that averaging weights from multiple fine-tuned models can improve accuracy and robustness without increasing inference-time cost, especially when the models lie in a compatible region of weight space.⁵ Task Arithmetic made the idea more compositional by treating fine-tuning changes as vectors that can be added or subtracted. TIES-Merging then focused on interference, trimming small updates and resolving sign disagreements. LoRAHub moved the discussion toward composing lightweight adaptation modules for cross-task generalization with only a few examples from a new task.⁶

AdaMerging belongs in this lineage, but its distinctive contribution is coefficient learning without original labels. It is not the broadest model-composition framework, and it is not the final answer to LLM consolidation. Its value is narrower and more operationally interesting: it shows that the coefficient-selection step can be automated and made more granular, especially at the layer level.

That matters because in real model operations, the dull steps are the expensive ones. Choosing coefficients by grid search does not scale well as tasks and layers increase. Hand-tuning merge weights is not strategy; it is artisanal spreadsheet suffering. Adaptive coefficient learning is attractive precisely because it replaces manual guesswork with an optimization procedure.

What the paper directly shows, and what Cognaptus infers

The paper directly shows that adaptive coefficient learning can substantially improve data-free multi-task model merging in the evaluated CLIP-based image-classification setting. It also shows that layer-wise coefficients outperform cruder task-wise weighting, and that the approach can improve robustness under certain distribution shifts.

Cognaptus infers that the same design principle is useful for enterprise AI architecture: treat fine-tuned variants as composable capability deltas, but evaluate compatibility before consolidation. In this reading, adaptive fusion is less about producing one grand unified model and more about building a disciplined model lifecycle: specialize, test, merge, validate, document, and retire redundant artifacts.

What remains uncertain is the degree to which AdaMerging-style coefficient learning transfers cleanly to large language models with messy instruction tuning, safety alignment, domain-specific data, and overlapping objectives. That uncertainty is not a footnote. It is the difference between a promising internal platform capability and a production outage with a postmortem titled “The Model Was Confident During Testing.”

The business value is controlled consolidation

The useful future of model fusion is not a fantasy in which every enterprise maintains one perfect model. Specialization will not disappear. New domains, new workflows, new regulations, and new customer behaviors will keep producing new adaptations.

The point is to stop treating every adaptation as a permanent deployment unit.

Adaptive model merging suggests a more mature pattern: fine-tune locally, merge selectively, evaluate jointly, and deploy only when the consolidated model preserves the behaviors that matter. That turns model sprawl from an uncontrolled inventory problem into a portfolio-management problem.

The old AI infrastructure question was, “How do we train a better model?” The newer question is less glamorous and more useful: “How do we reuse the models we already paid for without making them worse?”

AdaMerging does not fully answer that question for every LLM setting. But it gives the right shape of the answer: merging is not arithmetic convenience. It is adaptive allocation under interference.

And in an industry still addicted to scaling curves, there is something almost subversive about that. Sometimes the clever move is not to build a larger model. Sometimes it is to stop letting five smaller messes pretend they are a strategy.

Cognaptus: Automate the Present, Incubate the Future.

Enneng Yang, Zhenyi Wang, Li Shen, Shiwei Liu, Guibing Guo, Xingwei Wang, and Dacheng Tao, “AdaMerging: Adaptive Model Merging for Multi-Task Learning,” arXiv:2310.02575, 2023. https://arxiv.org/abs/2310.02575 ↩︎
Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi, “Editing Models with Task Arithmetic,” arXiv:2212.04089, 2022. https://arxiv.org/abs/2212.04089 ↩︎
Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal, “TIES-Merging: Resolving Interference When Merging Models,” arXiv:2306.01708, 2023. https://arxiv.org/abs/2306.01708 ↩︎
Oğuz Kağan Hitit, Leander Girrbach, and Zeynep Akata, “A Systematic Study of In-the-Wild Model Merging for Large Language Models,” arXiv:2511.21437, 2025. https://arxiv.org/abs/2511.21437 ↩︎
Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt, “Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time,” arXiv:2203.05482, 2022. https://arxiv.org/abs/2203.05482 ↩︎
Chengsong Huang, Qian Liu, Bill Yuchen Lin, Tianyu Pang, Chao Du, and Min Lin, “LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition,” arXiv:2307.13269, 2023. https://arxiv.org/abs/2307.13269 ↩︎

The real problem is not duplication; it is interference#

AdaMerging makes the coefficient the object of learning#

Entropy minimization is the clever shortcut, not magic#

The main result is coefficient discipline, not model alchemy#

The appendix tests cost and robustness, not a second thesis#

Model fusion changes the operating model for AI teams#

The LLM boundary is sharper than the slogan#

Where adaptive fusion fits beside LoRA and model soups#

What the paper directly shows, and what Cognaptus infers#

The business value is controlled consolidation#