Merge, Bound, and Determined: Why Weight-Space Surgery May Be CIL’s Most Underrated Trick

Catalogs change. Defect categories change. Fraud patterns change. Document types change. The model, unfortunately, often reacts like an employee who learns the new product line and immediately forgets where the old shelves are.

That is the everyday problem behind Class-Incremental Learning (CIL): a model must learn new classes over time while still recognizing old ones. The difficult part is not merely adding output labels. It is keeping the feature extractor from being rewritten by the latest task until yesterday’s knowledge becomes decorative archaeology.

The paper “Merge and Bound: Direct Manipulations on Weights for Class Incremental Learning” proposes a refreshingly direct answer: stop treating continual learning only as a loss-design problem, and start managing the model’s weight trajectory itself.¹ The method, Merge-and-Bound (M&B), combines three operations: inter-task weight merging, intra-task weight merging, and bounded model updates. None of these requires a new architecture. None changes the original learning objective. The trick is in how the training path is controlled.

This matters because the tempting summary—“they average model weights”—is too shallow. Generic model averaging is not the point. M&B is closer to disciplined weight-space governance: consolidate old tasks, smooth current-task learning, and prevent the model from drifting too far from a base that still remembers the past. Yes, it is less glamorous than a new transformer variant. That is probably why it is useful.

The real problem is not adding classes; it is controlling drift

In CIL, tasks arrive sequentially. Each task introduces a disjoint set of classes. The model is eventually tested on all classes seen so far, not only the latest batch. That turns every update into a negotiation between two forces:

Force	What the model needs	What can go wrong
Stability	Preserve old-task knowledge	The model becomes too rigid and fails to learn new classes
Plasticity	Adapt to new classes	The model overwrites old representations and forgets
Operational efficiency	Avoid heavy retraining, large replay buffers, or architectural growth	The solution becomes too expensive or too complicated to deploy repeatedly

Most CIL methods attack this with distillation, rehearsal, architecture expansion, bias correction, or parameter regularization. These are valid strategies, but each brings friction. Distillation and rehearsal often depend on access to past examples or proxy signals. Architecture expansion increases system complexity. Regularization can be too weak in harder incremental settings.

M&B starts from a different observation: when continual learning fails, the failure is visible in weight space. Each new task pulls the model in a new direction. If those directions are inconsistent, later updates damage earlier knowledge. If they are better aligned, the model has a chance to accumulate competence instead of repeatedly renovating the basement while the roof is on fire.

So the paper’s mechanism-first logic is simple:

Build a base model that carries old knowledge.
Train the current task without letting the model wander too far from that base.
Average useful checkpoints so the final model is not just the last, most task-biased point on the path.
Repeat.

The elegance is not that each operation is individually exotic. It is that the operations form a control system.

Inter-task merging turns past models into the next starting point

The first component, inter-task weight merging, consolidates models across tasks. After task $k$, M&B constructs the next base feature extractor by recursively averaging the previous base weights and the current task’s feature extractor:

$$ \theta^{base}\ast{k+1} = \frac{k-1}{k}\theta^{base}\ast{k} + \frac{1}{k}\theta_k $$

Operationally, this means the next task does not start from the most recent model alone. It starts from a moving average of all previous feature extractors. That matters because the most recent model is likely biased toward the most recent classes. A base model formed by averaging across stages should, in principle, carry a less volatile representation of accumulated knowledge.

The classifier is handled differently. Since each new task introduces new classes, M&B concatenates classifier weights for the new classes onto the existing classifier. The feature extractor is averaged; the classifier grows by adding the relevant class weights. This split is sensible. Representations should be stabilized across tasks; class heads must still make room for new labels.

For business readers, the practical analogy is version management. A naive fine-tuning pipeline says: “Take the newest model and keep going.” M&B says: “Before the next update, consolidate the institutional memory of previous versions.” That is a different operating philosophy. It treats previous models not as discarded checkpoints but as reusable evidence about where useful representations live.

Intra-task merging keeps the current task from becoming too noisy

The second component, intra-task weight merging, works inside one incremental stage. During training on the current task, M&B averages multiple checkpoints along the training trajectory:

$$ \Theta^{avg}_k \leftarrow \frac{n\Theta^{avg}_k + \Theta_k}{n+1} $$

At the end of the stage, the final model is replaced by this intra-task averaged model.

This is the plasticity side of the design. If inter-task merging is about preserving the past, intra-task merging is about making current-task learning less brittle. A final checkpoint can reflect the noise and bias of late-stage training. Averaging checkpoints along the path can improve generalization, much like stochastic weight averaging in other settings. But in CIL, the context is harsher: previous classes are underrepresented or absent, so the latest task can dominate the statistics.

That is why the paper’s BatchNorm detail is not a footnote for implementation nerds. After model averaging, BatchNorm running statistics may no longer match the merged weights. The authors test different strategies and find that resetting BatchNorm statistics severely damages performance. On CIFAR-100 with 50 incremental stages, resetting the running statistics drops overall accuracy to 25.60, compared with 65.38 for the selected strategy. No, that is not a rounding error; that is the model politely falling down the stairs.

The reason is intuitive. If BatchNorm statistics are recomputed after a reset using data from the current task, they become biased toward current classes. In CIL, where the model must still perform on previous classes, that bias is toxic. The authors instead forward data on top of the current model’s existing running statistics, making only slight updates. This keeps the merged model better aligned with the broader class history.

So the misconception to avoid is this: M&B is not “just average the weights and enjoy enlightenment.” The weight averaging only works because the method also manages when to average, what to average, and how to keep normalization statistics from betraying the old classes.

Bounded updates act like a trust region for memory

The third component, bounded model update, constrains how far the current model may move from the base model. Every $e_b$ epochs, the displacement $\Delta \Theta$ from the base model is clipped if it exceeds a threshold $B$:

$$ \Delta \Theta \leftarrow \begin{cases} B \cdot \frac{\Delta \Theta}{|\Delta \Theta|}, & \text{if } |\Delta \Theta| > B \ \Delta \Theta, & \text{otherwise} \end{cases} $$

This is the “bound” in Merge-and-Bound. The model can explore, but only within a ball around the base model. The base model is assumed to contain previous-task knowledge, so staying near it reduces destructive drift.

This mechanism also explains why M&B pairs naturally with weight merging. Model averaging tends to work better when the models being averaged live in compatible regions of the loss landscape. If each task update sends the model into a different basin, averaging can produce a mushy compromise. Bounding the update makes it more likely that successive models remain close enough for merging to be meaningful.

That is the central design pattern: merging needs proximity; bounding enforces proximity; intra-task averaging restores some adaptability inside that constraint.

The main results show consistent gains, especially when incremental learning becomes harder

The paper evaluates M&B by plugging it into existing CIL methods rather than presenting it as a standalone architecture. This is important. The claimed value is not “our new model beats your old model because it is larger.” The claim is “this training technique improves several existing approaches without changing their architecture or loss.”

The authors test M&B with methods including PODNet, AFC, FOSTER, and IL2A across CIFAR-100 and ImageNet-100/1000. They use average incremental accuracy, testing on all seen classes at each stage.

A few results are worth reading carefully:

Benchmark setting	Baseline	Baseline accuracy	With M&B	Practical interpretation
CIFAR-100, 50 tasks	PODNet	58.37	63.29	Stronger performance when many small increments increase forgetting pressure
CIFAR-100, 50 tasks	AFC	61.94	65.38	Gains persist on a strong distillation-based method
CIFAR-100, 50 tasks	FOSTER	59.60	62.47	Even architecture-expansion methods benefit when task count becomes high
ImageNet-1000, 10 tasks	PODNet	65.58	67.76	Improvement transfers to a larger benchmark
ImageNet-1000, 10 tasks	AFC	66.39	69.51	Larger-scale setting shows meaningful gains

The pattern matters more than any single number. M&B improves performance across methods and datasets, and the gains tend to become more valuable as the number of tasks rises. That is exactly where CIL becomes operationally painful: many small updates, limited past data, and no guarantee that the stream of new classes will stop politely after the benchmark designer says so.

The FOSTER result also deserves a careful reading. FOSTER already uses enlarged model capacity and ensemble-like effects, so M&B’s gains are smaller when the number of tasks is low. But at 50 tasks on CIFAR-100, FOSTER’s baseline falls to 59.60, while FOSTER + M&B reaches 62.47. The paper interprets this as M&B becoming more helpful when capacity-based advantages weaken under many increments.

This is a useful business clue. Weight-space discipline may be most attractive not when the model update problem is easy, but when the system faces repeated, fragmented updates over time.

The ablations explain the trade-off better than the headline tables

The most informative table in the paper is not necessarily the main benchmark table. It is the component analysis on CIFAR-100 with 50 incremental stages. The authors remove different pieces of M&B and track forgetting, average new-class accuracy, and overall accuracy.

Variant	Forgetting ↓	Avg. new accuracy ↑	Overall accuracy ↑	What the test is really showing
Full M&B	15.38	59.35	65.38	Balanced stability and plasticity
Without inter-task merging	21.77	62.74	61.81	New learning improves, but old knowledge erodes
Without intra-task merging	13.10	51.60	65.08	Forgetting is low, but adaptation suffers
Without bounded updates	18.72	64.14	64.21	Plasticity rises, but stability weakens

This is the paper’s strongest mechanism evidence. Each component pulls the system differently.

Removing inter-task merging increases forgetting sharply. That supports the claim that inter-task merging is the old-knowledge consolidation mechanism.

Removing intra-task merging produces the lowest forgetting, but the new-class accuracy collapses to 51.60. That is not a victory. A model that preserves the past by refusing to learn the present is not a production asset; it is a museum exhibit with an API endpoint.

Removing bounded updates increases forgetting and slightly improves new-class accuracy. That supports the idea that the bound constrains plasticity in exchange for stability.

Together, the ablation table shows that M&B is not a bag of averaging tricks. It is a negotiated balance. Inter-task merging and bounded updates protect the past. Intra-task merging gives the current task enough room to be learned. The full method wins because it coordinates these pressures, not because one magic component dominates.

The limited-memory tests are closer to enterprise reality than the clean benchmark setting

Many CIL papers use exemplar memory, where a small number of examples from previous classes are stored and replayed. That is useful experimentally, but business data retention is rarely that neat. Privacy rules, storage policies, licensing restrictions, and changing data schemas can all make historical examples difficult to keep.

The paper therefore tests limited-memory settings: one exemplar per class for PODNet and AFC, and no exemplar memory for IL2A. The results are large enough to matter:

Limited-memory CIFAR-100 setting	Baseline at 50 tasks	With M&B at 50 tasks
PODNet, 1 exemplar/class	14.78	20.84
AFC, 1 exemplar/class	23.59	35.25
IL2A, no memory	20.42	43.54

The IL2A result is especially striking: 20.42 to 43.54 at 50 tasks. The correct interpretation is not “M&B solves all low-memory continual learning.” The safer reading is narrower and more useful: when historical examples are scarce, controlling the model’s weight trajectory becomes more valuable, because there is less data available to pull the model back toward older classes.

This is where the business relevance becomes concrete. In product recognition, defect inspection, compliance classification, or document routing, organizations often cannot freely store all past data. A method that reduces dependence on exemplar memory can lower operational friction. But the boundary is equally concrete: the evidence is from supervised vision benchmarks, not from arbitrary enterprise multimodal systems, not from production LLMs, and not from real-time online learning under uncurated drift.

The diagnostic figures show alignment, not just accuracy

The paper includes two diagnostic analyses: cosine similarity among task update vectors and representational similarity using Centered Kernel Alignment (CKA). These should be read as mechanism diagnostics, not as a second thesis.

The cosine-similarity heatmaps compare task update directions. Without M&B, updates are largely independent and sometimes negatively correlated. With M&B, update vectors become positively correlated. That supports the idea that M&B makes sequential task learning less self-conflicting.

The CKA analysis compares representations across models trained at different stages. With M&B, representation similarity across stages increases. This supports the claim that the method reduces representational disruption, which is one way catastrophic forgetting appears inside the network.

These figures do not prove that M&B will generalize to every architecture or deployment context. They do something more specific: they connect the accuracy gains to the proposed mechanism. The model is not merely scoring higher by accident. Its task updates are more aligned, and its representations are more stable across stages.

For a practitioner, that distinction matters. Benchmark gains are useful; diagnostic alignment is more actionable. It suggests what to monitor if the idea is adapted: update-vector similarity, representation drift, forgetting metrics, and class-wise accuracy decay over time.

The robustness tests say the details are forgiving, except when they are not

The paper’s variant tests are best read as robustness and implementation guidance.

First, inter-task merging is compared with exponential moving averaging (EMA) variants. The authors find that their simple average performs better overall than EMA with smoothing factors 0.9, 0.5, and 0.1. EMA favors recent models, which increases forgetting. This supports the paper’s conceptual stance: in CIL, recency bias is dangerous because old classes remain part of the evaluation universe.

Second, intra-task averaging periods are varied. Overall accuracy stays close across periods: 65.38, 64.85, 65.19, and 65.51 for different averaging intervals. That suggests the method is not hypersensitive to the exact checkpoint averaging frequency.

Third, BatchNorm handling is not forgiving. Resetting running statistics gives severe degradation, while leaving statistics unchanged also hurts. This is the implementation trap. If a team copies “average checkpoints” but mishandles normalization statistics, it may blame the method while actually breaking the plumbing.

Fourth, bounded-update frequency and threshold are varied. Accuracy remains reasonably stable across tested settings, while the threshold controls the stability-plasticity trade-off. Larger thresholds allow more adaptation but can weaken the old-task anchor. In deployment terms, $B$ is not merely a hyperparameter; it is a governance knob for how much model drift the organization is willing to tolerate per update cycle.

The operational value is training-loop discipline, not inference-time magic

M&B’s business appeal comes from where it sits in the machine learning pipeline. It is a training-time method. The authors report that inter-task and intra-task weight merging take 0.003 seconds each, and bounded update takes 0.011 seconds on a single NVIDIA RTX-8000 with a ResNet-32 backbone. The method requires an additional forward pass for BatchNorm statistics per task, but it adds no inference-time overhead.

That last point is important. Many enterprise AI improvements secretly move cost into serving: larger models, ensembles, extra retrieval calls, routing layers, or cascading classifiers. M&B does not do that. It changes how the model is updated, not how it is served.

A practical adoption pathway would look like this:

Paper result	Cognaptus business interpretation	Boundary
Plug-in gains on PODNet, AFC, FOSTER, and IL2A	The method may improve existing CIL pipelines without redesigning the model architecture	Requires training-loop access and compatibility with the model’s normalization/statistics handling
Gains increase in harder multi-task settings	Useful for systems receiving many class updates over time	Benchmarks split known datasets into tasks; production drift may be messier
Strong low-memory results	Valuable when historical examples are scarce or hard to retain	Does not eliminate the need for validation on old classes
No inference overhead	Attractive for latency-sensitive deployment	Training complexity and monitoring still increase
Update alignment and CKA diagnostics	Provides measurable signals for model-drift governance	Diagnostics must be adapted to each architecture and data regime

This is not a license to fine-tune production models blindly. It is a reason to treat weight movement as an observable operational object. In plain language: do not only ask whether the updated model performs well today. Ask how far it moved, in what direction, and what it disturbed on the way.

Where this applies, and where it does not yet travel safely

The strongest evidence in the paper is for supervised class-incremental computer vision benchmarks. CIFAR-100 and ImageNet-100/1000 are standard, but they are still controlled environments. The tasks are cleanly defined. The label spaces are disjoint. Evaluation is structured around known class increments.

That is not the same as a live enterprise system where labels are noisy, class definitions evolve, and the data distribution shifts for reasons nobody wrote into the benchmark protocol. It is also not direct evidence for LLM continual learning, multi-agent memory, tabular fraud detection, or multimodal enterprise automation. Those may be interesting future directions, but they are inferences, not paper results.

There is also a dependency on implementation discipline. The BatchNorm finding shows that weight-space methods can be fragile when the surrounding training details are mishandled. Teams adopting this idea would need old-class validation sets, forgetting metrics, update-distance monitoring, and probably rollback rules. Otherwise “bounded update” becomes a nice phrase printed on an uncontrolled process. Very corporate, very familiar.

Still, the paper gives a useful operational principle: in continual learning, the path through weight space is part of the product. It should be managed as deliberately as data selection, evaluation, and deployment.

Weight-space surgery is underrated because it looks too simple

M&B is not flashy. It does not propose a grand new architecture. It does not promise a universal memory system. It does not claim that a model can continuously learn everything forever while remaining perfectly stable, which is refreshing because physics has already suffered enough from AI metaphors.

Its contribution is narrower and more credible: for class-incremental vision learning, directly manipulating weights through inter-task averaging, intra-task averaging, and bounded updates can improve stability-plasticity balance across several existing methods. The main evidence shows consistent benchmark gains. The ablations explain why the components work together. The low-memory tests suggest practical relevance. The diagnostic figures connect performance to update alignment and representation stability.

For businesses, the lesson is not “deploy M&B tomorrow.” The lesson is to stop treating model updates as isolated fine-tuning events. In systems that must learn sequentially, model maintenance needs a geometry: where the model starts, how far it moves, how updates align, and whether the resulting representation still serves yesterday’s classes.

That is why weight-space surgery may be underrated. It is not glamorous. It is not loud. It is just the kind of careful operational intervention that keeps a learning system from becoming very confidently forgetful.

Cognaptus: Automate the Present, Incubate the Future.

Taehoon Kim, Donghwan Jang, and Bohyung Han, “Merge and Bound: Direct Manipulations on Weights for Class Incremental Learning,” arXiv:2511.21490, 2025. ↩︎

The real problem is not adding classes; it is controlling drift#

Inter-task merging turns past models into the next starting point#

Intra-task merging keeps the current task from becoming too noisy#

Bounded updates act like a trust region for memory#

The main results show consistent gains, especially when incremental learning becomes harder#

The ablations explain the trade-off better than the headline tables#

The limited-memory tests are closer to enterprise reality than the clean benchmark setting#

The diagnostic figures show alignment, not just accuracy#

The robustness tests say the details are forgiving, except when they are not#

The operational value is training-loop discipline, not inference-time magic#

Where this applies, and where it does not yet travel safely#

Weight-space surgery is underrated because it looks too simple#