Energy Bills for Transformers: CEM Makes Layer Design Less Empirical
Weights are expensive twice. First, they cost money to train. Then they cost money every time a model is served, copied, quantized, tuned, monitored, and occasionally blamed for a cloud bill that no one wants to read twice. This is why every architecture paper with the words “efficient,” “low-rank,” “shared,” or “recursive” immediately attracts attention. Some of that attention is deserved. Some of it is merely the industry’s permanent hunger for a cheaper miracle with a nicer benchmark table. ...