Fine-tuning is supposed to be the polite part of AI customization.
A company uploads domain data. A provider adapts an aligned model. The final model still refuses harmful requests, still answers useful questions, and ideally becomes more competent at the client’s narrow task. Everyone nods. The demo works. The governance slide says “safety preserved.” The slide, as usual, is doing a lot of unpaid labor.
The paper behind this article, Surgery: Mitigating Harmful Fine-Tuning for Large Language Models via Attention Sink, studies a less comfortable possibility: during fine-tuning, a model may not simply “drift” away from safety. It may learn harmful response patterns through identifiable internal attention behavior.1 More specifically, the authors argue that attention heads associated with harmful learning can be separated using a statistic they call sink divergence, then suppressed during training through a regularizer called Surgery.
That is the interesting part. Not another “our method beats six baselines” table, although the paper has those too. The real contribution is a mechanism chain:
- Attention sinks are not just compression oddities.
- Their behavior differs between harmful-response data and refusal-response data.
- That difference separates attention heads into safety-relevant groups.
- A fine-tuning defense can regularize the harmful side of that separation.
In plain business language: the paper is not saying “filter bad data harder.” It is saying, “watch where the model stores and routes harmful adaptation, then discourage that routing while fine-tuning still happens.”
That is a different control surface. And for anyone selling fine-tuning-as-a-service, it is a useful one.
Harmful fine-tuning is not just bad data entering the pipeline
The familiar safety story treats harmful fine-tuning as a data problem. A user uploads malicious samples; the system fails to detect them; the model becomes less safe. The natural response is then data filtering, user screening, policy enforcement, or post-training repair.
Those responses are necessary. They are also incomplete.
The paper’s threat model assumes a fine-tuning-as-a-service setting where a provider fine-tunes an aligned model on a user-uploaded dataset. The dataset may contain both benign downstream samples and harmful samples. The provider remains responsible for the deployed model because the personalized model is hosted as an API-accessible service. This is not an academic edge case. It is almost exactly the risk shape of commercial customization: user data enters, model behavior changes, provider liability remains.
Existing fine-tuning-stage defenses generally fall into two buckets. One bucket tries to keep the fine-tuned model close to the original aligned model. The other tries to identify harmful samples or tokens and stop the model from learning them. Both approaches are sensible. Both treat the harmful learning process largely from outside the model: constrain the update, filter the input, preserve the distribution.
Surgery looks inside a different place: the attention sink.
An attention sink is a token position that receives disproportionately high attention from other tokens. Earlier work often discussed attention sinks in relation to long-context behavior, compression, or attention allocation. This paper asks a narrower safety question: when a model processes harmful prompt–harmful answer pairs versus harmful prompt–safe refusal pairs, do attention sinks behave differently in ways that predict harmful fine-tuning?
The answer the paper reports is yes.
Sink divergence turns attention behavior into a safety signal
The paper defines a sink value for each attention head: roughly, how strongly that head concentrates attention on its sink token. It then computes this value on two types of data:
- harmful data: harmful prompts paired with harmful responses;
- refusal data: harmful prompts paired with safe responses.
The difference becomes sink divergence:
Here, $D_h$ is the sink divergence of attention head $h$, while $S_h^{harmful}$ and $S_h^{refusal}$ are that head’s sink values on harmful and refusal data.
The sign matters.
If $D_h > 0$, the head has higher sink value on harmful data than on refusal data. If $D_h \leq 0$, the head leans the other way. The paper reports that sink divergence separates attention heads into two groups by sign. The authors then connect those groups to model safety through three observations.
First, as the harmful ratio in the fine-tuning dataset increases, the model’s harmful score increases. That is unsurprising, but it anchors the experiment: more harmful training content makes the model less safe.
Second, as the harmful ratio increases, more attention heads shift toward the positive sink divergence group. For one baseline, Lisa, the number of positive-divergence heads rises from 553 to 580 as the harmful ratio moves from 0 to 0.5. The exact count is less important than the direction: more harmful adaptation is associated with more heads showing the harmful-side sink pattern.
Third, disabling positive-divergence heads suppresses harmful behavior, while disabling negative-divergence heads worsens harmfulness. This is the mechanism test that makes the paper more than a correlation story. The authors are not only observing that unsafe models have different attention statistics. They are intervening on the identified heads and watching safety behavior change.
This leads to the paper’s central hypothesis: attention heads associated with learning harmful patterns during fine-tuning are separable by the sign of sink divergence.
Not perfectly proven in the philosophical sense. Mechanistic interpretability rarely gets that luxury. But it is operationally useful: the sign of a measurable attention statistic becomes a candidate safety control.
Surgery is a regularizer, not a separate safety lecture
The proposed method, Surgery, adds a sink divergence suppression term during fine-tuning. The basic training objective keeps the ordinary cross-entropy loss on the user task, then adds a penalty for positive sink divergence:
The ReLU term matters. Surgery does not punish all sink divergence equally. It punishes attention heads whose sink values are higher on harmful samples than on refusal samples. In effect, it pushes heads away from the positive-divergence group and toward the negative-divergence group.
The training loop therefore uses three sources:
| Data source | Role in Surgery | Business translation |
|---|---|---|
| User fine-tuning data | Preserve downstream task learning | The model must still do the client’s job |
| Simulated harmful data | Measure harmful-side sink values | The provider needs known unsafe patterns for calibration |
| Refusal data | Measure safe-response sink values | The provider needs a reference for aligned refusal behavior |
This is an important operational detail. Surgery assumes the provider maintains both harmful and refusal datasets. That is realistic for serious model providers, but not free. It requires safety data infrastructure, not just a new training flag.
It also means Surgery is not “AI safety without safety data.” The method still depends on curated safety examples. The novelty is where the safety signal is applied: not only at the output text or sample label level, but at the attention-head sink behavior level.
The paper’s name is a little dramatic. “Surgery” sounds like someone opened the transformer with sterile instruments and confidence. In practice, the method is a regularized training objective. Less cinematic. More useful.
The main experiments support the mechanism, not just the leaderboard
The paper evaluates Surgery against several fine-tuning-stage baselines: standard supervised fine-tuning, Lisa, SafeGrad, ConstrainedSFT, AsFT, SPARD, and DSS. The main metrics are:
- Harmful Score (HS): the proportion of unsafe responses to unseen malicious instructions; lower is better.
- Fine-tuning Accuracy (FA): downstream task accuracy on benign test data; higher is better.
The default experiment uses mixed fine-tuning datasets with harmful samples from RepNoise-Refusal, enriched from BeaverTails, and benign downstream data from tasks such as GSM8K, SST2, and AGNEWS. The models include Llama3-8B-Instruct, Gemma2-9B-Instruct, and Qwen2.5-14B-Instruct.
The headline result is straightforward: Surgery lowers harmful scores while preserving task accuracy reasonably well.
Under different harmful ratios on Llama3-8B with GSM8K, Surgery reports an average harmful score of 8.42, compared with 23.10 for plain SFT, 13.88 for Lisa, 16.30 for SafeGrad, 17.80 for ConstrainedSFT, 15.96 for AsFT, 19.84 for SPARD, and 16.86 for DSS. Its average fine-tuning accuracy is 68.70, close to Lisa’s 69.08 and ConstrainedSFT’s 69.14.
That trade-off matters. A safety defense that destroys downstream task performance is not a deployment solution. It is a polite refusal to customize. Surgery’s promise is that it reduces harmful learning without turning the fine-tuned model into a decorative compliance brochure.
The cross-model results are also useful, though not equally strong across architectures:
| Model | Best baseline HS | Surgery HS | Surgery FA |
|---|---|---|---|
| Llama3-8B-Instruct | 14.80 | 8.90 | 68.50 |
| Gemma2-9B-Instruct | 9.50 | 8.20 | 77.10 |
| Qwen2.5-14B-Instruct | 17.20 | 11.30 | 89.70 |
The Gemma result is the least dramatic because SafeGrad is already strong there. That is not a weakness; it is useful information. A mechanism can generalize without producing equally theatrical gains everywhere. Business readers should prefer this kind of uneven result to a suspiciously smooth victory lap.
Across downstream tasks, Surgery reports harmful scores of 9.40 on SST2, 7.40 on AGNEWS, and 8.90 on GSM8K. The fine-tuning accuracy picture is mixed but acceptable: 94.50 on SST2, 88.60 on AGNEWS, and 68.50 on GSM8K. AGNEWS accuracy is below plain SFT’s 90.30, so this is not “safety for free.” It is closer to “meaningful safety improvement with manageable task-performance cost.”
That is still a good bargain in many deployment settings. It is not magic. Sadly, the invoice remains.
The paper’s test suite has different jobs
The evidence in the paper is not one homogeneous block. Some experiments support the main defense claim. Others test mechanism, robustness, cost, or implementation sensitivity. Mixing these together creates the usual bad summary: a pile of numbers wearing a lab coat.
A cleaner reading is this:
| Test or analysis | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Harmful-ratio experiments | Main evidence | Surgery remains effective as harmful proportion changes | Real-world attackers will match these ratios |
| Fine-tuning sample-number experiments | Robustness test | Harmful score remains relatively stable as sample count changes | Unlimited scaling of the defense |
| Cross-model experiments | Generalization check | The method works across 8B–14B instruction-tuned models | Frontier-model behavior |
| Cross-task experiments | Robustness across benign objectives | Sink suppression is not tied to one downstream task | No utility trade-off in all domains |
| HarmBench and SorryBench evaluation | Out-of-distribution harmful-query test | Safety gains transfer beyond BeaverTails test prompts | Complete coverage of harmful behavior |
| Head-shift and layer-wise analyses | Mechanism interpretability | Surgery changes sink divergence in the intended direction | Full causal explanation of all safety behavior |
| Runtime and memory evaluation | System feasibility | Surgery adds modest overhead relative to SFT | Production cost under all training stacks |
| Learning-rate and regularizer sensitivity | Implementation detail / sensitivity | Hyperparameters matter; too aggressive or unstable training hurts | Plug-and-play deployment without tuning |
| HarmfulDiSP appendix comparison | Ablation / alternative mechanism | Sink divergence is a better target than a related harmful-representation statistic | Sink divergence is the only possible useful metric |
This is where the paper becomes more valuable for practitioners. The main result says Surgery works in the tested setting. The supporting analyses say why the authors think it works, when it becomes fragile, and what kind of engineering discipline would be needed to use it.
For example, the paper reports that after Surgery training, more than 96% of attention heads shift toward the negative sink divergence group. That is not a benchmark score; it is a mechanism sanity check. The method claims to suppress positive sink divergence. The observed head shift suggests it is actually touching the intended internal behavior.
Likewise, the system evaluation reports that Surgery takes 0.24 hours and 102.46 GB of GPU memory in their setup, compared with 0.19 hours and 97.37 GB for standard SFT. Lisa, by contrast, takes 0.42 hours and 148.47 GB. The point is not that every production run will cost those exact amounts. The point is that Surgery’s mechanism-level control does not obviously require an absurd compute premium in the tested environment.
That matters because many safety methods quietly die at the integration stage. They are elegant until someone asks how much GPU memory they require, at which point the room discovers a sudden interest in “future work.”
The business value is training-time diagnosis, not just safer output
For enterprise AI teams, the most useful reading of this paper is not “use Surgery tomorrow.” It is broader: harmful fine-tuning may be diagnosable through internal adaptation signals before the final model is deployed.
That changes the governance conversation.
Most business safety processes focus on input screening and output testing. Input screening asks whether the user’s fine-tuning data contains prohibited content. Output testing asks whether the resulting model responds safely to red-team prompts. Both are necessary. Both miss a middle layer: what the model is learning internally while fine-tuning happens.
Surgery suggests a third layer:
| Governance layer | Typical question | What Surgery adds |
|---|---|---|
| Data intake | Is this dataset allowed? | Harmful/refusal calibration sets remain necessary |
| Training monitoring | Is the model learning unsafe internal patterns? | Sink divergence can become a head-level warning signal |
| Deployment evaluation | Does the model produce unsafe outputs? | Harmful Score tests remain the final gate, not the only gate |
| Cost control | Can safety be preserved without expensive retraining? | Regularization may be cheaper than post-hoc repair |
This does not replace audits, red-teaming, or policy enforcement. It makes them less blind. A provider could imagine monitoring sink-divergence patterns during fine-tuning and stopping or adjusting training when unsafe adaptation becomes visible. Surgery itself is one intervention; the monitoring idea may be the more general business lesson.
The method also speaks to a specific market problem: safe customization. Many organizations want domain-adapted models but do not want to run their own frontier-scale safety research lab. They need controls that live inside the fine-tuning pipeline, not only after deployment. A sink-divergence regularizer is closer to a training-control primitive than to a policy slogan.
And that is exactly where enterprise AI safety needs to move. Less “we tested the chatbot after the fact.” More “we constrained the adaptation process while the model was still changing.”
“Forgetting on purpose” means refusing to memorize the wrong adaptation
The title of this article says memorization is the bottleneck. The paper itself does not use the phrase “memorization sink”; the earlier placeholder version of this post did, and it was too loose. The paper is about attention sinks under harmful fine-tuning, not generic memorization during pretraining.
Still, the title can be salvaged if we interpret “memorization” carefully.
The practical bottleneck is not that a model remembers too much data in general. The bottleneck is that during customization, the model may internalize harmful response patterns strongly enough that ordinary alignment no longer controls behavior. In that sense, safe fine-tuning requires selective forgetting: the model should learn the downstream task while refusing to consolidate harmful-answer routes.
Surgery operationalizes that idea through attention sink behavior. It does not erase knowledge. It penalizes a pattern where harmful samples command stronger sink behavior than refusal samples. That is a more precise version of “forgetting on purpose”: not amnesia, but disciplined non-acquisition.
There is a useful mental model here. Fine-tuning is not a clean upload of new skills. It is a negotiation among representations: task competence, refusal behavior, harmful response patterns, and whatever shortcuts the model finds. Safety fails when harmful patterns become easy for the model to route through. Surgery tries to make that route less attractive.
The model still learns. It just gets a nudge away from the ugly shortcut. A small mercy, but in AI safety, small mercies sometimes ship.
Boundaries that matter before anyone productizes this
The paper is promising, but the deployment boundary is clear.
First, the evidence comes from simulated harmful fine-tuning using public datasets and open-source instruction-tuned models from 8B to 14B scale. That is valuable, but it does not automatically transfer to proprietary frontier models, multimodal systems, long-horizon agents, or customer-specific data distributions.
Second, the method assumes access to harmful and refusal datasets for sink-divergence computation. Serious providers may have these. Smaller teams may not. Poorly constructed calibration data could weaken the method or produce misleading internal signals.
Third, all reported experiments use full-parameter fine-tuning. Many commercial customization workflows use LoRA, adapter tuning, prompt-based customization, retrieval-augmented generation, or hybrid pipelines. Surgery may inspire controls for those settings, but the paper does not prove them.
Fourth, the hyperparameter results matter. At a very low learning rate, Surgery can achieve very low harmful score but poor fine-tuning accuracy. At a high learning rate, both safety and utility deteriorate. The regularizer intensity also has a trade-off: too weak and it does little; too strong and it may harm task performance. This is not a one-click governance button. It is a training method that needs tuning.
Finally, harmful score is a useful metric but not a complete safety guarantee. The paper evaluates on BeaverTails, HarmBench, and SorryBench-style harmful prompts. That is a reasonable benchmark suite. It is not the full space of malicious user behavior, indirect prompt attacks, tool misuse, agentic planning failure, or domain-specific compliance risk.
These boundaries do not make the paper weak. They make it usable. A result with edges is easier to operationalize than a result pretending to be a universal law.
The takeaway for AI providers is to monitor the adaptation, not only the output
The strongest business implication of Surgery is not that every provider should copy the exact regularizer tomorrow. The stronger implication is that fine-tuning safety needs internal telemetry.
If a provider offers model customization, it should ask more than:
- Did we scan the uploaded dataset?
- Did the final model pass a red-team suite?
- Did the user sign the acceptable-use policy?
Those are entry-level questions. Necessary, but not sufficient.
The better questions are:
- Which internal components changed most during fine-tuning?
- Are safety-relevant attention heads shifting toward harmful-response behavior?
- Does the model preserve refusal behavior while learning the downstream task?
- Can unsafe adaptation be suppressed during training instead of repaired after deployment?
- Does the safety intervention preserve utility enough to be commercially viable?
Surgery gives one technical answer to those questions. It identifies positive sink divergence as a signal of harmful learning and suppresses it with a training-time regularizer. The reported gains across harmful ratios, sample counts, models, tasks, and harmful-query benchmarks suggest the idea deserves attention.
Not worship. Attention.
The larger lesson is that alignment is not a sticker placed on a model before customization. It is a property that can be weakened, redistributed, or routed around during fine-tuning. If the business model is “bring us your data and we will adapt the model,” then the safety model must include what happens inside adaptation itself.
Fine-tuning makes models remember new things. The question is no longer whether they remember. It is what internal pathways they use when they do.
And sometimes the safest model is not the one that remembers more. It is the one trained, with some discipline, not to memorize the wrong route.
Cognaptus: Automate the Present, Incubate the Future.
-
Guozhi Liu, Weiwei Lin, Tiansheng Huang, Ruichao Mo, Qi Mu, Xiumin Wang, and Li Shen, “Surgery: Mitigating Harmful Fine-Tuning for Large Language Models via Attention Sink,” arXiv:2602.05228, https://arxiv.org/html/2602.05228. ↩︎