Measure Twice, Quantize Once
TL;DR for operators Compression is usually sold as a tidy pipeline: pick a smaller architecture, prune some layers, quantize the result, then call procurement and explain why the GPU bill is still rude. This paper argues that the pipeline itself is the problem.1 The authors propose a joint compression framework for Llama-3.1-8B that searches architectural choices and quantization choices together. That means the system does not first decide “how much model” it wants and only afterward decide “how many bits” each part deserves. It treats width, depth, layer importance, weight precision, activation precision, and latency as interacting deployment variables. ...