Neural Architecture Search

Measure Twice, Quantize Once

TL;DR for operators Compression is usually sold as a tidy pipeline: pick a smaller architecture, prune some layers, quantize the result, then call procurement and explain why the GPU bill is still rude. This paper argues that the pipeline itself is the problem.1 The authors propose a joint compression framework for Llama-3.1-8B that searches architectural choices and quantization choices together. That means the system does not first decide “how much model” it wants and only afterward decide “how many bits” each part deserves. It treats width, depth, layer importance, weight precision, activation precision, and latency as interacting deployment variables. ...

When Three Examples Beat a Thousand GPUs

A GPU bill is usually treated as a hardware problem. Buy faster accelerators, shorten training runs, negotiate a better cloud contract. Less often asked is whether the expensive part of the pipeline began with a badly calibrated prompt. An LLM generating neural-network architectures can create thousands of candidates before training begins. If the prompt provides too little context, the model may repeatedly produce shallow variations of the same familiar design. Add more examples, and it may combine useful ideas across architectural families. Add still more, and the output can become worse, incomplete, or invalid. ...

Rollout Renaissance: How Pareto-NRPA Revives Monte Carlo for Multi-Objective Optimization

TL;DR for operators Many business optimisation problems do not ask for “the best answer.” They ask for a menu of acceptable compromises: cheaper but slower, faster but riskier, smaller but slightly less accurate, feasible but less elegant. This paper matters because it adapts an old Monte Carlo workhorse, Nested Rollout Policy Adaptation, to that messy multi-objective setting. ...