Beam Me Less, Scotty: MoE Models Learn When Not to Call Every Expert
Latency has a way of turning elegant model architecture into an invoice. Mixture-of-Experts models were supposed to soften that invoice. Instead of sending every token through the same dense feed-forward machinery, an MoE layer sends each token to only a few experts. In theory, this gives us scale without paying for all parameters on every token. In practice, many deployed MoE models still behave like a restaurant that insists every guest order the same number of dishes. The experts differ, but the billable count is fixed. ...