Sparse Computation

The old business pitch for Mixture-of-Experts was satisfyingly simple: activate fewer parameters, spend less compute, keep more capacity on the shelf. It sounded like cloud cost optimization with a PhD. Useful, but not exactly poetic. The newer story is more interesting. Three recent arXiv papers—DOT-MoE, DAG-MoE, and LoopMoE—suggest that MoE is no longer just a sparsity trick. It is becoming an architecture stack for conditional computation: first decide how experts are formed, then how selected experts interact, and finally how sparse expert systems can be reused over iterative depth.123 ...