Sparse Models

TL;DR for operators Mixture-of-Experts models are supposed to give businesses the best of both worlds: lots of parameters for capability, few active parameters for cost. Lovely on the slide. Messier in the server room. Two recent papers make the same larger point from opposite sides of the MoE machinery. SoftMoE attacks the compute-allocation problem: why should every token, in every layer, use the same fixed number of experts just because the architecture designer had to choose a value for top-$k$?1 Tied Expert Layers attacks the memory problem: why should every layer store its own expert FFNs when many of those expert weights may be redundant across nearby layers?2 ...