Unsafe at Any Bit: Patching the Safety Gaps in Quantized LLMs

TL;DR for operators

Quantizing an LLM is not a harmless cost-saving step. It changes the model, and the paper analysed here shows that those changes can weaken safety even when familiar utility scores still look respectable. That is the uncomfortable part: the dashboard can say “performance preserved” while the model has become more willing to comply with harmful requests. Very efficient. Very modern. Very easy to miss.

The paper’s main evidence comes from a systematic evaluation of quantized versions of Llama-2-7B-Chat and Gemma-7B-Instruct across mainstream quantization approaches, including AWQ, AQLM, LLM-QAT, and QLoRA.¹ The authors test INT8 and INT4 models in the main assessment, then extend the analysis to 3-bit and 2-bit settings. They also vary the quantization-assisting data: benign utility data, indirectly harmful obedience-inducing examples, and directly harmful examples. Attack Success Rate rises sharply in many compressed models, while MT-Bench and AlpacaEval often remain close enough to baseline to lull a deployment team into a false sense of competence.

The business implication is not “never quantize.” That would be theatrical and useless. The implication is that quantization creates a new model artefact requiring its own safety acceptance test. A model that passed refusal evaluation before compression has not automatically passed after compression. Calibration data should be reviewed as safety-relevant data, not merely performance fuel. INT4 and sub-4-bit deployments deserve stricter checks. Serving-time decoding settings should also be tested, because the paper shows that generation choices can expose safety weakness in quantized models.

The proposed repair, Q-resafe, is useful because it does not treat the quantized model as broken glass requiring a complete re-alignment ceremony. It builds preference pairs by asking the pre-quantization model and the quantized model to respond to the same prompts, treats the pre-quantization response as preferred, then uses Direct Preference Optimization while selectively updating safety-critical weights. The result is a targeted safety patch: more like correcting the damaged load-bearing beams than repainting the whole building and calling it structural engineering.

The boundary matters. The evidence is strong enough to justify a deployment checkpoint, but not broad enough to conclude that every quantized model in every domain fails in the same way. The work focuses mainly on two 7B open-source instruction/chat models, selected quantization methods, benchmark safety attacks, and GPT-based evaluation pipelines. Treat it as a practical warning with a repair strategy, not a cosmic theorem about all low-bit models.

The cheap-model step is now a safety boundary

A familiar deployment story goes like this. The research team fine-tunes or selects a model. The safety team evaluates it. The infrastructure team compresses it so it can run faster, cheaper, or closer to the user. Then the organisation ships the compressed version because, according to the usual utility checks, not much was lost.

The paper Q-resafe: Assessing Safety Risks and Quantization-aware Safety Patching for Quantized Large Language Models makes that workflow look a little too tidy.¹ Its core message is simple: quantization can preserve utility while damaging safety. The danger is not that the model becomes obviously stupid. The danger is that it remains useful enough to pass ordinary performance screening while becoming more permissive under harmful prompts.

That distinction is operationally important. Most companies do not deploy compressed models because they enjoy numerical elegance. They do it because memory, latency, hardware cost, and edge deployment matter. Quantization helps by converting higher-precision model representations into lower-precision forms such as INT8, INT4, or even lower bit-widths. The usual success criterion is whether the compressed model still answers normal tasks well. But safety behaviour is not just another average-case capability. It often lives in the uncomfortable corners: refusal, constraint-following, adversarial prompting, and the model’s willingness to stop being “helpful” when helpfulness becomes a liability.

The authors’ contribution is useful because it moves the discussion from suspicion to mechanism. They do not merely say “compressed models can be unsafe.” They vary the compression method, the bit-width, and the data used during quantization. Then they propose a patch that tries to restore safety without throwing away the efficiency gains that motivated quantization in the first place.

The result is a much more useful lesson for operators: the safety boundary does not end when the full-precision model is approved. It moves downstream into the compression pipeline.

Compression edits alignment, even when nobody asked it to

Quantization sounds mechanical. A 16-bit number becomes an 8-bit or 4-bit number. Storage drops. Inference becomes cheaper. Everyone nods. Procurement smiles.

The trouble is that an aligned LLM is not a spreadsheet being saved in a smaller format. Its safety behaviour is encoded in weights, activations, instruction tuning, preference optimisation, and post-training artefacts. Quantization changes numerical representations across that system. Even if the goal is only to preserve task performance, the process can perturb safety-relevant behaviour.

The paper makes this mechanism visible through three channels.

First, post-training quantization can disturb the model without giving it a chance to adapt. AWQ, for example, performs quantization without fine-tuning. That makes it attractive for efficient deployment. It also means the model’s compressed weights must preserve all relevant behaviour through the quantization procedure itself. The paper finds that AWQ can raise safety risk under decoding attacks while leaving utility relatively intact.

Second, quantization methods that use fine-tuning or calibration data inherit the risks of that data. AQLM, LLM-QAT, and QLoRA use quantization-assisting datasets in different ways. The paper tests benign data, directly harmful data, and a more subtle indirectly harmful setting that encourages obedience without necessarily presenting overtly toxic instructions. This is where the result becomes unpleasantly practical. A dataset does not need to look like a cartoon villain to weaken refusal behaviour. It can simply teach the model to comply too eagerly.

Third, lower bit-widths reduce representational room for preserving all behaviours equally. A model can retain enough general instruction-following ability to perform well on utility benchmarks while losing the finer structure that supports safety. In the paper’s ablations, safety generally worsens as bit-width drops from 8-bit to 4-bit and then into 3-bit and 2-bit regimes. There is no mystery here. If you squeeze the model harder, more of its internal behaviour gets negotiated away. Safety is one of the things that can lose the negotiation.

This is the misconception the paper corrects: if utility survives, alignment must have survived too. That belief is convenient. It is also wrong.

The paper isolates three knobs: method, data, and bit-width

The experimental design is valuable because it does not treat “quantization” as a single blob. The authors separate the pipeline into knobs a deployment team can actually control.

Experimental knob	Paper setup	Likely purpose	Operational question
Base model	Llama-2-7B-Chat and Gemma-7B-Instruct	Main evidence across two aligned open-source 7B models	Does compression affect safety beyond one model family?
Quantization method	AWQ, AQLM, LLM-QAT, QLoRA	Main comparison between PTQ, QAT, and LoRA-style quantized fine-tuning	Which compression workflow creates more safety exposure?
Quantization-assisting data	Risk-I benign UltraChat data, Risk-II indirectly harmful obedience data, Risk-III directly harmful AdvBench-derived data	Main evidence for data sensitivity	Is calibration data a safety asset or a hidden attack surface?
Bit-width	INT8 and INT4 in the main evaluation; 3-bit and 2-bit in ablation	Sensitivity test	How much extra safety risk comes with more aggressive compression?
Safety metric	Attack Success Rate on harmful prompts, plus HarmBench and harmfulness-score checks in appendix	Main evidence plus robustness check	Does the compressed model become more willing to produce harmful output?
Utility metric	MT-Bench and AlpacaEval	Trade-off measurement	Did the model remain useful enough to hide the safety loss?

The baselines are important. Llama-2-7B-Chat starts with an ASR of 0.3 in the paper’s baseline table, with MT-Bench at 6.65 and AlpacaEval at 71.37. Gemma-7B-Instruct starts with ASR 9.2, MT-Bench 6.25, and AlpacaEval 66.53. These are not toy systems invented for a failure demo. They are aligned instruction/chat models being pushed through common compression workflows.

The authors then ask what happens when these models are quantized. Not “can we find one weird jailbreak?” Not “can we produce a dramatic anecdote?” The paper’s more useful question is whether safety degradation appears systematically across method choice, calibration data, and precision level.

It does.

Utility survives well enough to mislead the dashboard

The most operator-relevant result is not that safety sometimes gets worse. Everyone in AI safety has a warehouse full of ways to make a model worse. The more useful finding is that safety can degrade while ordinary utility remains tolerable.

For Llama-2-7B-Chat under INT4 quantization, the paper reports the following pattern in its main safety assessment:

INT4 method on Llama-2-7B-Chat	Risk-I benign ASR	Risk-II indirect harmful ASR	Risk-III direct harmful ASR	MT-Bench	AlpacaEval
AWQ	42.4	42.4	42.4	6.51	68.37
AQLM	18.5	75.5	77.4	6.40	66.42
LLM-QAT	16.9	82.9	71.2	6.71	66.54
QLoRA	42.3	83.4	85.3	6.40	63.92

The baseline Llama model’s ASR is 0.3 in the paper’s baseline setting. After quantization, the benign-data cases already show material safety degradation. Under indirectly harmful or directly harmful quantization-assisting data, some ASR values climb into the 70–85 range. Meanwhile, MT-Bench remains in roughly the same neighbourhood as the baseline. AlpacaEval weakens, but not so catastrophically that a team focused on utility alone would necessarily block release.

That is the trap. Utility metrics can tell the deployment team, “The model is still useful.” They do not necessarily say, “The model is still safe.”

The Gemma results follow the same broad direction, though not always with the same magnitude. Under INT4, QLoRA reaches ASR 39.4 on benign data, 68.6 on the indirectly harmful setting, and 61.3 on directly harmful data, with MT-Bench still at 6.15. AQLM and LLM-QAT also show safety degradation under higher-risk data. The exact numbers vary by model and method, but the operational pattern holds: compression can preserve visible helpfulness while weakening refusal.

For a business reader, this changes the meaning of a “successful” quantization run. A compression job is not successful merely because the model is smaller and still scores decently on utility. It is successful only if the compressed artefact passes the safety tests appropriate to its deployment context.

Calibration data is a governance object, not just a performance sample

The paper’s most quietly useful move is its treatment of quantization-assisting data. It divides data into three risk levels.

Risk-I uses benign UltraChat samples. Risk-III uses directly harmful instructions and harmful responses derived from AdvBench. Risk-II is subtler: it uses indirectly harmful identity-shifting or obedience-inducing examples, including an “absolutely obedient agent” style prompt in the evaluation setting. The key point is that Risk-II does not need to be full of overtly toxic content to damage safety. It teaches the model the wrong posture.

That distinction matters for enterprise AI governance. Many teams already know they should not fine-tune on obviously harmful content. Fine. Congratulations on not handing the model a flamethrower. The harder problem is data that looks like ordinary instruction-following but reinforces unconditional compliance. In customer support, internal automation, legal drafting, medical triage, finance, education, and developer tooling, “always follow the user’s instruction” is often a product requirement masquerading as a safety failure.

The paper shows that indirectly harmful data can be just as dangerous as directly harmful data in some settings. For Llama-2-7B-Chat with INT4 LLM-QAT, ASR is 82.9 under Risk-II and 71.2 under Risk-III. With INT4 QLoRA, Risk-II reaches 83.4 and Risk-III 85.3. The exact ordering is less important than the practical lesson: obedience-shaped data can be enough to compromise safety.

The benign-data results are also uncomfortable. Even with Risk-I data, safety degradation appears across methods. That suggests the problem is not only contaminated data. It is also the objective mismatch. Quantization workflows are usually designed to preserve utility. They are not automatically designed to preserve refusal boundaries, safety style, or adversarial robustness.

So calibration data should be treated like a governed dataset. It should have provenance, risk labels, review procedures, and post-quantization safety tests. If that sounds bureaucratic, consider the alternative: quietly shipping a cheaper model that became more compliant with harmful prompts because the compression pipeline was treated as plumbing.

Bit-width is a safety policy, not just an infrastructure setting

Bit-width is usually discussed as an efficiency trade-off. INT8 is safer for quality, INT4 is cheaper, sub-4-bit is more aggressive. The paper adds another axis: safety degradation.

In the bit-width ablation on Llama-2-7B-Chat using UltraChat as quantization-assisting data, ASR rises as precision drops:

Method	8-bit ASR	4-bit ASR	3-bit ASR	2-bit ASR
AQLM	17.1	18.5	28.6	40.1
LLM-QAT	15.1	16.9	25.4	36.9
QLoRA	41.7	42.3	67.3	82.0
AWQ	10.5	17.4	29.5	38.6
Q-resafe	1.6	1.8	5.9	12.4

The pattern is not perfectly linear across every method, but the direction is clear enough for decision-making. More aggressive compression tends to create more safety exposure. The steep move from 4-bit to 3-bit is especially relevant for teams chasing edge deployment or extreme serving efficiency. A sub-4-bit model should not be approved by arguing that the 8-bit model passed evaluation. That is the kind of reasoning that sounds efficient until it becomes incident documentation.

This does not mean INT4 or 3-bit models are unusable. It means the bit-width decision belongs in the model risk register. It should be attached to a safety evaluation plan, not buried in an inference-cost spreadsheet.

Q-resafe repairs the compressed model without pretending it is full precision

After documenting the failure mode, the paper proposes Q-resafe: a quantization-aware safety patching framework. Its design is interesting because it respects the reason quantization exists. It does not simply say, “Fine, retrain everything.” That would be like solving a leaky pipe by demolishing the building. Thorough, perhaps. Not economical.

Q-resafe is built around three ideas.

First, it uses the pre-quantization model as a safety teacher. For prompts from an auxiliary calibration dataset, the authors generate one response from the pre-quantization model and one from the quantized model. The pre-quantization response is labelled as the preferred response; the quantized response is labelled as dispreferred. This creates preference triplets without requiring manual annotation for every example.

Second, it uses Direct Preference Optimization to pull the quantized model back toward the safety behaviour of its pre-quantization version. The point is not to make the quantized model generically more polite. The point is to transfer safety behaviour from the model before compression damaged it.

Third, it selectively updates safety-critical weights. The method uses SNIP-style importance scoring to identify weights that matter most for safety-related behaviour, then applies a mask so the patch touches only the selected portion. These weights are periodically re-identified during training because the relevant subset can shift as the model updates.

A simplified view looks like this:

Prompt set
   ↓
Pre-quantization model response  → preferred response
Quantized model response         → dispreferred response
   ↓
DPO safety-patching objective
   ↓
SNIP-based mask identifies safety-critical weights
   ↓
Selective update of quantized model
   ↓
Safety-patched quantized LLM

The mechanism matters because it explains why Q-resafe is not merely another fine-tuning baseline. It is trying to preserve the compression benefit by leaving most quantized weights alone while editing the small subset most relevant to the safety failure. In the AWQ setting without fine-tuning, the authors use a variant that identifies safety-critical weights in the full-precision model and keeps those weights at 16-bit while quantizing the rest. That is a slightly different repair shape, but it follows the same principle: do not compress safety-critical structure blindly.

The patch evidence is strongest when read as targeted repair, not magic alignment

The paper reports that Q-resafe restores safety much closer to the pre-quantization models while preserving utility. The most important experimental sections have different roles, and reading them as one undifferentiated pile of numbers would miss the point.

Test or table	Likely purpose	What it supports	What it does not prove
Main quantization assessment	Main evidence	Quantization can raise ASR across methods, models, datasets, and bit-widths while utility remains relatively preserved	That every model and every quantization method fails equally
Q-resafe safety patching results	Main evidence	Targeted patching can reduce safety degradation under benign, indirect harmful, and direct harmful settings	That Q-resafe guarantees safety against adaptive attacks
AWQ decoding-strategy test	Robustness and implementation-specific stress test	Quantized models without fine-tuning can become vulnerable under generation settings; selective precision can help	That decoding is the only cause of AWQ safety degradation
Safety-critical weight ablation	Ablation	Identifying and updating the right weights is central to Q-resafe’s efficiency and safety recovery	That the exact threshold will generalise unchanged
SFT/DPO/Q-resafe comparison	Ablation and efficiency comparison	Q-resafe can match DPO-like safety recovery with much lower GPU time in tested settings	That SFT or DPO are always poor choices
8/4/3/2-bit study	Sensitivity test	Lower bit-width tends to worsen safety risk; Q-resafe remains stronger across tested precisions	That sub-4-bit safety is solved
HarmBench and harmfulness-score check	Robustness check	The safety degradation is not merely an artefact of one ASR scoring method	That benchmark evaluation captures all real-world harm
LLM.int8, NF4, FP4 appendix test	Exploratory extension	Q-resafe can patch additional popular quantization formats	That it is fully method-agnostic across all compression stacks

The ablation on safety-critical weight identification is particularly revealing. On Llama-2-7B-Chat with 4-bit quantization using a benign UltraChat setting, updating a broad selected set of safety-critical weights produces ASR around 1.6–1.8 with modest GPU time. Reducing the selected portion raises ASR. With no safety-critical identification, ASR rises to 42.2, with MT-Bench at 6.4. The model is still useful. It is also much less safe. There is the whole paper in miniature.

The method comparison is also operationally interesting. For LLM-QAT at 4-bit, SFT gives ASR 12.4 and takes 8.4 GPU hours; DPO gives ASR 1.5 and takes 9.6 GPU hours; Q-resafe gives ASR 1.6 and takes 1.2 GPU hours. For QLoRA, SFT gives ASR 26.9 in 3.4 GPU hours; DPO gives ASR 2.4 in 3.8 GPU hours; Q-resafe also gives ASR 2.4 in 1.2 GPU hours. This is not just a safety result. It is an efficiency result: targeted repair can be cheaper than broad repair while reaching similar safety outcomes in the tested setting.

That matters because model governance often fails when the recommended control is too expensive or too slow to become routine. Q-resafe’s practical appeal is that it turns post-quantization safety repair into something closer to a pipeline step. Not free. Not automatic. But plausible.

Decoding settings are part of the compressed model’s risk surface

The appendix includes a decoding-focused discussion that should not be treated as a side curiosity. For quantization without fine-tuning, the authors evaluate AWQ under decoding attacks by varying generation parameters such as temperature and sampling settings. They report that modified decoding can expose safety vulnerabilities, and Table 4 shows that Q-resafe-style treatment brings ASR much closer to the full-precision baseline under those varied decoding settings.

This has a very practical consequence: a model is not fully specified by its weights alone. The deployed behaviour also depends on system prompts, sampling parameters, routing, wrappers, refusal templates, and application logic. A compressed model tested under one decoding configuration may behave differently when product teams adjust creativity, latency, or output diversity.

The paper’s decoding result should therefore be read as a robustness test. It does not mean every serving configuration is dangerous. It means serving configuration belongs in the evaluation matrix. If a company approves a quantized model at one temperature and deploys it at another, it has changed the product. The model card did not magically update itself. Apparently one still has to do the work.

What Cognaptus would change in the deployment checklist

The paper directly shows a benchmarked safety problem and a proposed technical patch. The business interpretation is broader but should stay disciplined. The right conclusion is not that quantization is reckless. It is that quantization creates a new artefact in the risk pipeline.

Deployment layer	What the paper directly shows	Cognaptus inference for business use	Boundary
Model registry	Quantized versions can have different ASR from their full-precision baselines	Register each quantized model as a distinct governed artefact, not merely a variant file	Evidence is from selected open-source 7B models
Safety acceptance	Utility can remain broadly preserved while safety worsens	Require post-quantization refusal and harmfulness evaluation before release	Benchmarks are not exhaustive real-world safety tests
Calibration data	Benign, indirectly harmful, and directly harmful data produce different risk profiles	Treat quantization-assisting data as safety-relevant and review it accordingly	Risk-II examples are sensitive and not fully released
Bit-width choice	Lower bit-widths generally increase ASR in ablations	Attach stricter safety thresholds to INT4 and sub-4-bit deployments	Hardware and model architecture may change the curve
Repair strategy	Q-resafe restores safety while preserving utility in tested settings	Add targeted post-quantization repair to the compression pipeline where safety risk is material	Requires access to the pre-quantization model or a strong aligned teacher
Serving configuration	Decoding changes can affect safety behaviour	Evaluate compressed models under actual production decoding and system-prompt settings	Does not cover every product wrapper or agentic workflow

For a company deploying LLMs in regulated or reputationally sensitive settings, the immediate policy is straightforward:

Evaluate the full-precision model.
Quantize the model.
Re-evaluate the quantized model on safety, not just utility.
Review the quantization-assisting data.
Test production decoding settings.
Apply targeted patching if safety degradation is material.
Record the compressed model as a separate approved artefact.

That is not glamorous. Good governance rarely is. Glamour is what teams reach for when the checklist is missing.

The strongest result is operational, not philosophical

There is a tempting philosophical reading of the paper: alignment is fragile, compression perturbs representations, safety is not separable from capability, and so on. All true enough. But the more useful reading is operational.

The paper identifies a specific failure mode in a specific engineering workflow. It then proposes a repair that fits the workflow. That is the shape of research that can change deployment practice.

The mechanism-first view is important because it prevents two lazy interpretations.

The first lazy interpretation is “quantization is unsafe.” That is too broad. Quantization is a family of methods with different objectives, bit-widths, calibration data, and serving assumptions. The paper shows risk variation across methods and settings. It does not justify a blanket ban.

The second lazy interpretation is “Q-resafe solves safety.” Also too broad. Q-resafe reduces measured safety degradation in the tested scenarios. It does not certify models against every jailbreak, every language, every agentic tool-use environment, every domain-specific harm, or every future attack strategy. It is a repair method, not a papal blessing.

The practical middle is better: quantization should be treated as a safety-relevant model transformation. That transformation can be tested. Its risks can be reduced. Its remaining uncertainty can be documented.

The boundaries are narrow enough to matter

The paper is useful, but its limits should shape how it is applied.

First, the main experiments focus on Llama-2-7B-Chat and Gemma-7B-Instruct. These are meaningful models, but they are not the whole deployment universe. Larger models, smaller edge models, multilingual models, domain-specialised models, and proprietary systems may behave differently.

Second, the quantization methods are representative, not exhaustive. AWQ, AQLM, LLM-QAT, and QLoRA cover important territory, and the appendix adds LLM.int8, NF4, and FP4. Still, compression methods evolve quickly. A safety policy should evaluate the actual compression stack being used, not rely on a one-time conclusion from one paper.

Third, the safety measurement relies heavily on ASR under benchmarked harmful prompts, with additional HarmBench and harmfulness-score checks. That is appropriate for the paper’s question, but production harm is messier. Enterprise systems involve retrieval, tools, memory, user roles, policy layers, multilingual inputs, and workflow consequences. A model that behaves safely in a benchmark can still fail in an application context.

Fourth, Q-resafe assumes a usable safety teacher: ideally the pre-quantization model, or another strongly aligned model. That is plausible in many enterprise settings because compression usually happens after model selection. But it is not guaranteed in third-party model supply chains, where teams may receive only the compressed artefact.

Fifth, the paper’s indirectly harmful dataset is intentionally not fully released due to sensitivity. That is understandable, but it limits external inspection. The concept is still valuable: obedience-inducing data can matter. But exact replication of that risk condition is constrained.

None of these limits weaken the central deployment lesson. They define its scope. The paper does not say every quantized model is unsafe in every setting. It says enough quantized models become less safe under plausible workflows that safety evaluation after compression should become normal practice.

The bit-budget now has a safety budget

Quantization is not going away. It is too useful. Companies want cheaper inference, lower latency, private on-device deployment, and smaller serving footprints. Refusing compression would be less a safety strategy than a procurement tantrum.

The smarter move is to stop treating compression as a purely technical optimisation. A compressed model is not just the same model with a smaller invoice. It is a new behavioural object. It needs safety testing, data governance, serving-configuration checks, and sometimes targeted repair.

Q-resafe is valuable because it points toward a practical version of that workflow. Use the safer full-precision model as a teacher. Build preference pairs without expensive manual annotation. Identify the safety-critical weights. Patch those weights instead of broadly disturbing the model. Preserve utility where possible. Then test again, because belief is not an evaluation protocol.

The headline lesson is simple: when the model loses bits, do not assume it kept its boundaries.

Cognaptus: Automate the Present, Incubate the Future.

Kejia Chen, Jiawen Zhang, Jiacong Hu, Yu Wang, Jian Lou, Zunlei Feng, and Mingli Song, “Q-resafe: Assessing Safety Risks and Quantization-aware Safety Patching for Quantized Large Language Models,” arXiv:2506.20251, submitted 25 June 2025, ICML 2025. https://arxiv.org/abs/2506.20251 ↩︎ ↩︎

TL;DR for operators#

The cheap-model step is now a safety boundary#

Compression edits alignment, even when nobody asked it to#

The paper isolates three knobs: method, data, and bit-width#

Utility survives well enough to mislead the dashboard#

Calibration data is a governance object, not just a performance sample#

Bit-width is a safety policy, not just an infrastructure setting#

Q-resafe repairs the compressed model without pretending it is full precision#

The patch evidence is strongest when read as targeted repair, not magic alignment#

Decoding settings are part of the compressed model’s risk surface#

What Cognaptus would change in the deployment checklist#

The strongest result is operational, not philosophical#

The boundaries are narrow enough to matter#

The bit-budget now has a safety budget#