Transformer Architecture

Cost has a way of making architecture less romantic. In diagrams, a Transformer block looks clean: attention mixes tokens, the MLP transforms features, residual connections keep information flowing. In deployment, the same diagram becomes an invoice. Attention is especially expensive because its cost grows with sequence length. In the paper’s LLaMA-7B timing example, an attention layer has roughly half the parameters of an MLP layer, yet runs nearly twice as long at sequence length around 3,000 and about three times as long around 7,000. Attention is elegant. It is also very good at charging rent. ...