Model architecture has a recurring habit: when something works, we freeze it into a default and move the argument elsewhere.
Attention gets the drama. Routing gets the diagrams. Context windows get the product demos. Meanwhile, the feedforward network sits there, quietly holding a large share of the parameters and applying the same nonlinearity to every token, every time, as if “one curve fits all” were a law of nature rather than a convenient engineering choice.
The paper More Expressive Feedforward Layers: Part I. Token-Adaptive Mixing of Activations asks a simple question with inconvenient implications: what if the FFN should not use one fixed activation function at all?1
The proposed answer is Mixture of Activations, or MoA. The name sounds dangerously close to Mixture of Experts, so let us kill the wrong interpretation early. MoA is not another sparse-expert routing trick. It does not send a token to different parameterized experts. It keeps the same linear projections and changes something smaller but more basic: which activation functions are mixed for each token.
That distinction is the paper’s real contribution. MoE says: different tokens may need different parameter blocks. MoA says: before you allocate another expert, perhaps different tokens merely need different nonlinear transformations inside the same FFN. Less glamorous, yes. Also harder to dismiss.
The usual FFN makes one activation function do everyone’s job
A standard Transformer FFN takes a token representation, projects it into a wider hidden space, applies a nonlinear activation, and projects it back. In the simpler Type-I form, it looks like:
$$ f(x) = W_2 \sigma(W_1x) $$
The more modern Type-II form, used by SwiGLU-style FFNs, introduces a multiplicative gate:
$$ g(x) = W_3(\sigma(W_1x) \odot W_2x) $$
The architectural history is familiar: ReLU gave way to GELU, SwiGLU, ReLU² variants, and other activation choices. But most designs still make one activation function carry the whole layer. The model may learn billions of weights, but the nonlinear “shape” applied inside the FFN remains globally fixed.
That is a little strange. Tokens are not equivalent. A code token, a rare entity, a syntactic marker, and a numerical fragment may not benefit from the same local transformation. Yet the FFN activation is usually treated as a static design choice, selected before training and shared across inputs.
The paper’s first move is not MoA. It is the more modest Learnable Activations, or LA.
Instead of using one activation function, LA learns a linear combination of several candidate activations. For Type-I FFNs, this means replacing $\sigma(W_1x)$ with something like:
$$ \sum_k \alpha_k \sigma_k(W_1x) $$
where the $\sigma_k$ functions come from a small dictionary such as GELU, SiLU, ReLU, ReLU², LeakyReLU, and tanh.
This already improves the design. The FFN no longer has to pretend that one handcrafted curve is always enough. But LA still has a serious limitation: the coefficients $\alpha_k$ are fixed after training. Every token receives the same activation blend.
So LA changes the recipe, but not per customer. MoA changes the recipe per token.
MoA is token-adaptive nonlinear mixing, not expert routing
MoA replaces fixed activation-mixing coefficients with input-dependent gates. Instead of learning one global activation mixture, the layer computes mixing weights from the token representation itself.
Conceptually:
| Design | What varies during inference | What stays shared | Core limitation or advantage |
|---|---|---|---|
| Fixed activation FFN | Nothing about the activation choice | Linear projections and activation function | One nonlinear form for all tokens |
| Learnable Activations | The learned activation blend is trained, then fixed | Linear projections | Better global curve, still input-independent |
| Mixture of Activations | Activation blend varies by token | Linear projections | Token-adaptive nonlinearity with small architectural change |
| Mixture of Experts | Expert selection varies by token | Often not the expert parameters | Higher capacity, but heavier routing and expert-management complexity |
This is the key mechanism. MoA does not ask, “Which expert should process this token?” It asks, “Which nonlinear shape should this token use inside the FFN?”
That is why soft gating makes sense here. In a typical MoE layer, hard or sparse routing matters because expert networks contain separate parameters and routing determines compute cost. In MoA, the activation dictionary is small and the linear projections are shared. The paper therefore uses soft gates by default: all activation candidates may contribute, with token-dependent weights.
For Type-II FFNs, the authors test several variants:
| Variant | Mechanism | Practical reading |
|---|---|---|
| One-sided MoA | Keeps one SwiGLU-style branch fixed and makes the other branch activation-adaptive | Lowest conceptual disturbance |
| Bi-sided MoA | Applies adaptive activation mixtures to both branches | More flexible, but not always best at scale |
| Quadratic MoA | Mixes pairwise activation products with input-dependent coefficients | Richer nonlinear interaction, potentially more expressive |
The point is not that every variant should become production default. The point is that activation design becomes a conditional computation problem at the nonlinear layer, not only at the expert layer. That is a quieter architectural shift than MoE, which is exactly why it is easy to miss.
The theory says the gain is not just decoration
The paper’s theoretical contribution is finite-width expressive separation. This matters because the obvious objection is: neural networks are universal approximators, so why care about one activation mixture versus another?
Because “universal approximation” often hides the bill in the required width. If a function can be represented only by growing width, it may be theoretically possible but operationally expensive. The paper therefore compares fixed-activation FFNs, LA, and MoA at the same finite width.
For Type-I FFNs, the paper proves a strict hierarchy:
$$ \bigcup_{\sigma \in K} \mathcal{F}^{(m)}\sigma \subsetneq \mathcal{F}^{(m)}{\mathrm{LA}} \subsetneq \mathcal{F}^{(m)}_{\mathrm{MoA}} $$
The first separation is intuitive. LA can combine activation primitives, so it can represent a function such as:
$$ T_{\mathrm{LA}}(x) = \mathrm{ReLU}(x_1) + \mathrm{ReLU}^2(x_1) $$
with a single learned activation mixture. A fixed-activation FFN cannot reproduce both local behaviors at the same finite width. ReLU-like networks have piecewise-constant derivatives; smooth activations such as GELU, SiLU, and tanh do not reproduce ReLU’s derivative discontinuity. In plain English: one curve cannot cheaply mimic every useful kink and curvature at once. Shocking, absolutely no one in numerical engineering should be shocked.
The second separation is more important. MoA can represent token-adaptive modulation such as:
$$ T_{\mathrm{MoA}}(x) = \tanh(\lambda x_1)\mathrm{ReLU}(x_2) $$
Here one coordinate modulates the amplitude of a feature created in another coordinate. LA cannot do this at fixed width because its activation weights are constants. MoA can, because the activation mixture depends on the input.
That is the paper’s core mechanism in one sentence: LA learns a better global nonlinearity; MoA learns when to use which nonlinearity.
The Type-II theory repeats the hierarchy under a more demanding setting. This is important because Type-II FFNs already contain multiplicative structure. The authors still show strict separation: fixed Type-II activation pairs are contained in quadratic LA, which is contained in quadratic MoA. The witness functions involve input-adaptive three-factor interactions, such as:
$$ \mathrm{ReLU}(x_2)\mathrm{ReLU}(x_1)\tanh(x_1) $$
The message is not that these toy functions are business cases. Please do not build a product roadmap around $\tanh(x_1)\mathrm{ReLU}(x_2)$. The message is architectural: input-dependent activation mixing creates representational behavior that fixed or globally learned activation mixtures cannot match at the same width.
That is a clean theoretical reason to expect MoA to help before the experiments begin.
The ablations are design selection, not the main victory lap
The paper’s first experimental block is a design study on a 0.12B dense model. Its purpose is not to prove final scaling. It is an ablation stage: choose activation dictionaries, compare LA and MoA variants, and decide which MoA designs deserve larger runs.
The activation dictionary uses abbreviations: $g$ for GELU, $s$ for SiLU, $r^2$ for ReLU², $l$ for LeakyReLU, $t$ for tanh, and $r$ for ReLU.
The headline from Table 1 is mixed in exactly the way useful ablations often are:
| 0.12B design study | Likely purpose | What it shows | What it does not prove |
|---|---|---|---|
| Activation-dictionary ablation | Ablation | Type-I MoA improves over the ReLU² baseline; Type-I LA slightly underperforms. Type-II LA and MoA variants all improve over the SwiGLU baseline in this setting. | That every activation dictionary will transfer to every scale |
| Gating-function ablation | Robustness/sensitivity test | Sigmoid gating is best in the tested Type-I and Type-II MoA settings; tanh gating is weak in Type-I despite being sufficient for theory. | That sigmoid is universally optimal |
| Final 0.12B comparison | Main design selection | MoA performs best for both Type-I and Type-II FFNs; Type-II MoA slightly leads overall. | That the 0.12B winner is automatically best at 2B or production scale |
The numbers are worth reading carefully.
For Type-I on the 0.12B dense model, LA with the selected dictionary is slightly worse than baseline, with relative loss $+0.003$, while MoA improves the baseline with relative loss $-0.015$. For Type-II, the SwiGLU baseline is improved by one-sided LA, bi-sided LA, qd-LA, one-MoA, bi-MoA, and qd-MoA in the reported ablation. The strongest Type-II result in the gating ablation is bi-MoA with sigmoid gates, reaching relative loss $-0.029$.
The gating result is especially useful because it separates expressivity from optimization. The theory uses tanh gates for proof. The experiments find sigmoid gates train better in the tested setting. The authors explicitly note that theory establishes expressive separation, not optimization dynamics. Near initialization, $\tanh(0)=0$, while $\mathrm{sigmoid}(0)=1/2$, which may give sigmoid stronger initial signal propagation.
That is a good example of a paper not confusing “can represent” with “will train well.” A rare courtesy. Enjoy it.
Dense-model results show lower loss and higher learning-rate tolerance
The main dense language-model experiments compare Type-II MoA variants against a SwiGLU-based Llama baseline. The models include 0.12B and 0.25B dense settings trained with 20 tokens per parameter, plus a 0.12B setting trained with 100 tokens per parameter.
The paper reports two main findings.
First, MoA variants consistently achieve lower terminal validation loss than the tuned Llama baseline across the dense settings.
Second, MoA variants tolerate larger peak learning rates. For the 0.12B model trained at 20 tokens per parameter, the baseline’s best peak learning rate is $2 \times 10^{-3}$, while all MoA variants use $3 \times 10^{-3}$. For the 0.12B model trained at 100 tokens per parameter, the baseline uses $4 \times 10^{-3}$, while one-MoA and qd-MoA use $5 \times 10^{-3}$, and bi-MoA uses $6 \times 10^{-3}$.
This matters because “lower loss” can come from many sources: extra parameters, lucky tuning, unstable comparisons, or architecture. The learning-rate tolerance result suggests MoA may also alter the training landscape, not merely enlarge the function class.
But this is still a controlled pre-training finding, not a product KPI. Lower validation loss is encouraging. It is not the same as lower serving cost, higher retention, better coding accuracy, or improved safety behavior. Architecture papers tend to whisper that distinction. Product teams should read it in bold.
MoE results test whether MoA still helps when routing already exists
The paper’s MoE experiments are the more interesting business signal because they test MoA in a setting where conditional computation already exists.
The models are LlamaMoE variants from 0.25B to 2B total parameters, with activated parameter sizes from 0.11B to 0.62B. Each uses 32 sparse experts, activates 4 sparse experts per token, and includes one shared expert. The authors use the Muon optimizer and warmup-stable-decay schedules to build strong baselines. Importantly, for the larger MoE experiments, they tune the baseline learning rate first and reuse the same learning rate for the MoA variants, rather than giving MoA a separate tuning advantage.
That experimental choice matters. It makes the MoE evidence less easy to dismiss as “the new method got more tuning love.”
The result: one-MoA and qd-MoA consistently achieve lower terminal loss than the Muon-trained LlamaMoE baselines across the tested model sizes. The paper says gains exceed 0.01 in most experiments, and the scaling-law panel shows the performance gap remains stable across sizes, with one-MoA and baseline curves nearly parallel.
The interpretation should be precise. This does not prove MoA will keep improving indefinitely. It does suggest that MoA is not merely a small-dense-model curiosity. More importantly, it suggests MoA can complement MoE instead of replacing it.
That is the architectural punchline: routing tokens to experts and adapting activation mixtures are separate levers.
A sparse MoE layer changes where the token goes. MoA changes the nonlinear shape applied when it gets there. If both help, then “conditional computation” in LLMs may be too narrowly framed when it only means parameter routing.
Zero-shot scores improve, but the evidence is modest
The paper includes a downstream zero-shot evaluation on LlamaMoE-2B using ARC-C, HellaSwag, OpenBookQA, and WinoGrande.
| Benchmark | LlamaMoE | one-MoA | qd-MoA |
|---|---|---|---|
| ARC-C | 36.50 | 36.86 | 36.52 |
| HellaSwag | 43.31 | 44.54 | 44.91 |
| OpenBookQA | 30.20 | 29.80 | 30.20 |
| WinoGrande | 58.80 | 60.22 | 60.22 |
| Average | 42.20 | 42.86 | 42.96 |
The average improvement is real but not dramatic: one-MoA improves from 42.20 to 42.86, and qd-MoA to 42.96. The task-level picture is uneven. qd-MoA ties the baseline on OpenBookQA; one-MoA is slightly lower there.
So the downstream evidence supports the pre-training story, but it does not turn MoA into a magic benchmark machine. It says: lower pre-training loss appears to carry through to modest average zero-shot gains in this tested 2B MoE setting. That is useful. It is not a coronation.
For business readers, this is exactly the kind of result that should trigger engineering evaluation, not procurement enthusiasm. If your team trains or adapts LLM architectures, MoA deserves a place in the experiment queue. If you buy API access to frontier models, this paper is upstream architecture intelligence, not an immediate purchasing criterion.
The overhead story is where MoA becomes operationally interesting
The strongest business argument is not “MoA improves validation loss.” Many things improve validation loss if you are willing to pay enough.
The more interesting claim is that MoA improves loss with small parameter and memory overhead.
The paper’s parameter argument is straightforward. The activation dictionary size is small and independent of the Transformer hidden dimension $d$. The extra MoA parameters scale as $O(d)$, while the dominant Transformer parameter terms scale as $O(d^2)$. That asymmetry is the reason MoA can be a low-intrusion upgrade rather than a new capacity regime.
The authors also run a parameter-controlled ablation. The largest MoA variant, bi-MoA, has 0.11980B parameters, compared with 0.11974B for the Llama baseline. They increase the baseline FFN hidden dimension to match the parameter count and call it Llama-large. Result: MoA reduces terminal loss by 0.029, while the parameter-matched Llama-large yields almost no gain.
This test is important because it blocks the lazy explanation: “Maybe the method just has more parameters.” In this experiment, the answer is mostly no. The gain appears tied to the mechanism, not merely the count.
Runtime overhead is not zero, though. On a Dense-0.5B model with torch.compile, the paper reports:
| Model | Wall-clock time | Memory usage |
|---|---|---|
| Type-I baseline | 190 ms | 26,571 MiB |
| Type-I MoA | 196 ms, 1.03× | 26,627 MiB, 1.00× |
| Type-II baseline | 196 ms | 28,759 MiB |
| Type-II MoA | 222 ms, 1.13× | 28,873 MiB, 1.00× |
This is a practical trade-off. Type-I MoA looks cheap in this measurement. Type-II MoA is noticeably slower, though still far from a catastrophic redesign. Memory remains nearly unchanged.
For training organizations, this means MoA is not “free,” but it is plausibly cheap enough to test. For inference-heavy deployment, the wall-clock increase matters more, especially if the architecture is used at high volume. A 13% overhead is not a rounding error when multiplied by millions of requests. It is also not a reason to ignore a mechanism that may reduce required model scale or training loss. Annoyingly, reality has more than one column.
The vision experiment is an exploratory extension, not a second thesis
The paper extends MoA beyond language by testing qd-MoA in self-supervised vision pre-training using a Masked Autoencoder setup with ViT-Base/16. The baseline uses a SwiGLU FFN; qd-MoA replaces that FFN while keeping the rest of the architecture unchanged. Models are pre-trained for 800 epochs with global batch size 4096 and mask ratio 0.75.
This experiment should be read as an exploratory extension and partial robustness check. It is not the center of the paper. Its purpose is to ask whether the mechanism is language-specific or more generally useful in pre-training.
The result is directionally consistent with the language experiments. Under the same learning rate, qd-MoA achieves lower validation reconstruction loss than the baseline. The best peak learning rate also rises from $3 \times 10^{-4}$ for the baseline to $1.2 \times 10^{-3}$ for qd-MoA.
This supports the claim that token-adaptive activation mixing is not limited to autoregressive text models. But it does not prove broad multimodal superiority, classification transfer, robustness, or downstream vision deployment value. It says the mechanism travels at least to one large-scale self-supervised vision setting. That is enough for a serious appendix-level signal.
What Cognaptus infers for business use
The paper directly shows four things.
First, MoA creates a strict finite-width expressive hierarchy over fixed activations and learnable activations. Second, MoA improves pre-training loss across several dense and MoE language-model settings. Third, it does so with small parameter and memory overhead, though with measurable wall-clock overhead. Fourth, the benefits appear in a vision pre-training test as well.
The business inference is narrower but useful: MoA is a candidate low-intrusion FFN upgrade for teams that control model architecture and training.
It is especially relevant in three cases.
| Business context | Why MoA is relevant | What to test before adoption |
|---|---|---|
| Training small or mid-scale proprietary LLMs | The paper’s evidence covers 0.12B to 2B models, close to many internal-model regimes | Validation loss, downstream task transfer, inference latency |
| Building MoE models | MoA appears complementary to expert routing, not redundant with it | Interaction with router balance, expert specialization, training stability |
| Optimizing architecture under parameter constraints | Parameter-controlled ablation suggests gains are not merely from more weights | Whether MoA beats alternative FFN changes under the same compute budget |
The most interesting operational possibility is not replacing MoE. It is adding a smaller conditional mechanism inside FFNs before escalating to heavier expert designs. In some systems, token-adaptive activation mixing may provide part of the benefit of conditional computation without requiring another layer of expert routing complexity.
That is an inference, not a result the paper proves. The paper does not benchmark MoA against every alternative FFN modification, every production kernel, every quantization regime, or every latency target. But it gives enough evidence to make MoA a serious experimental candidate.
Boundaries that matter before anyone gets too excited
The evidence is strongest for pre-training loss, not finished product performance. The zero-shot gains are positive but modest. There is no instruction-tuning study, no RLHF or preference-alignment analysis, no long-context evaluation, no code-specialized benchmark suite, and no serving-system cost analysis beyond the reported training overhead measurement.
Scale is another boundary. The paper goes up to 2B total parameters for MoE models. That is meaningful, but it is not frontier scale. The scaling curves are encouraging because the performance gap remains stable in the tested range, but extrapolation remains extrapolation. Even when dressed in log plots, it is still extrapolation.
Implementation also matters. MoA adds activation evaluations and gating operations. The theoretical overhead is small relative to FFN matrix multiplications, but real performance depends on kernels, compilation, batching, hardware, and inference stack design. Type-II MoA’s reported 1.13× wall-clock overhead on Dense-0.5B is acceptable in some training contexts and expensive in some production contexts.
Finally, MoA is a Part I paper. The conclusion itself points toward future work on distinct token-adaptive nonlinear mixing mechanisms across input dimensions. That signals the authors see this as a design family, not a fully settled endpoint.
The real message is that FFNs are still under-designed
The easiest way to summarize this paper is: MoA improves LLM pre-training by mixing activation functions per token.
That summary is accurate and insufficient.
The deeper message is that FFN layers still contain underexplored architectural degrees of freedom. We have spent years scaling parameters, routing experts, tuning attention variants, and stretching context. Meanwhile, the activation function inside the FFN has often remained a global default: pick one curve, train everything else.
MoA challenges that default. It says the nonlinear part of an FFN can be conditional without turning the whole layer into an expert router. It says learned activation mixtures are useful, but token-adaptive mixtures are more expressive. It says finite-width expressivity is not a decorative theorem if it predicts a mechanism that survives pre-training experiments.
For model builders, this is worth attention because the change is local. MoA does not demand a new training paradigm. It does not require separate expert weights. It does not ask the system to become an architectural circus. It inserts a small adaptive mechanism into a part of the model that already carries much of the nonlinear burden.
The practical conclusion is therefore boring in the best possible way: test it. Not because MoA is guaranteed to become the next default, but because the paper identifies a cheap architectural lever with theory, ablations, dense-model evidence, MoE evidence, overhead analysis, and a cross-domain sanity check.
In an industry addicted to making models larger, a paper that asks models to be slightly more selective about their nonlinearities feels almost restrained.
Dangerous stuff.
Cognaptus: Automate the Present, Incubate the Future.
-
Mingze Wang, Jinbo Wang, Yikuan Xia, Kai Shen, and Shu Zhong, “More Expressive Feedforward Layers: Part I. Token-Adaptive Mixing of Activations,” arXiv:2605.26647, 2026. https://arxiv.org/abs/2605.26647 ↩︎