MoA Than One Curve: Teaching FFNs to Choose Their Nonlinearity
Model architecture has a recurring habit: when something works, we freeze it into a default and move the argument elsewhere. Attention gets the drama. Routing gets the diagrams. Context windows get the product demos. Meanwhile, the feedforward network sits there, quietly holding a large share of the parameters and applying the same nonlinearity to every token, every time, as if “one curve fits all” were a law of nature rather than a convenient engineering choice. ...