Activation Sparsity

The Experts Are Sparse Inside: Why MoE Cost Cuts Stop at 1.2x Cost has a way of making architecture fashionable. Mixture-of-Experts models became attractive because they promise a pleasant bargain: keep a large total parameter count, but activate only a small part of the model for each token. In business language, that sounds like capacity without the full compute bill. In engineering language, it means routing each token to a few expert feed-forward networks instead of running every expert all the time. ...