The Experts Are Sparse Inside: Why MoE Cost Cuts Stop at 1.2x
Cost has a way of making architecture fashionable.
Mixture-of-Experts models became attractive because they promise a pleasant bargain: keep a large total parameter count, but activate only a small part of the model for each token. In business language, that sounds like capacity without the full compute bill. In engineering language, it means routing each token to a few expert feed-forward networks instead of running every expert all the time.
That bargain is real. It is also incomplete.
The paper Uncovering Intra-expert Activation Sparsity for Efficient Mixture-of-Expert Model Execution studies a less obvious source of savings: not skipping whole experts, but skipping low-impact neurons inside the experts that were already selected.1 The result is more interesting than a simple “MoE gets faster” headline. The authors show that existing pretrained MoE models already contain substantial intra-expert activation sparsity, sometimes near 90% under their accuracy-retention criterion, without retraining or changing model parameters. They also show why this does not magically become a 10x serving improvement. The sparse computation still has to pass through routers, gate projections, GPU kernels, memory layouts, batching behavior, and Amdahl’s law. Hardware, as usual, declines to be impressed by a clean theoretical ratio.
The paper is therefore useful not because it says “sparsity is good.” That sentence has already had a long career. It is useful because it identifies where the unused computation sits, which parts of the MoE pipeline can safely exploit it, and where the speedup leaks away when the idea is placed inside vLLM rather than inside a paper diagram.
The saving is not only between experts
The usual MoE story is about inter-expert sparsity. A token arrives, a router scores available experts, and only a subset of those experts is used. The model avoids running the full feed-forward capacity for every token. This is the architectural trick behind much of the MoE efficiency story.
But once the router has selected an expert, the expert is often treated as a dense unit: run its feed-forward network, compute all intermediate neuron activations, and combine the result. The paper asks a simple follow-up question: when a token enters a selected expert, does it actually need the whole expert?
The answer appears to be no.
The authors define intra-expert activation sparsity as sparsity within the selected expert’s activation outputs. They take pretrained MoE models, sort each expert’s post-activation outputs, and zero out the lowest-scoring neurons. No model retraining. No architectural surgery. No new activation function. Just a test of whether many within-expert neuron activations contribute little enough that they can be skipped at inference time.
This distinction matters because pushing MoE efficiency only through more expert granularity creates training problems. More experts can mean more routing instability, load imbalance, expert collapse, representation collapse, and fewer updates per expert. Intra-expert sparsity is different: it tries to extract additional inference efficiency from the model already trained, rather than demanding an even more fragile expert-routing structure during training.
The mechanism is easy to state:
- MoE routing skips unrelated experts.
- Intra-expert sparsity skips weak neurons inside the experts that remain.
- Production speedup depends on whether those skipped neurons correspond to computation the serving engine can actually avoid.
The third line is where the paper earns its keep.
What the paper actually tests
The study covers eight off-the-shelf MoE models, ranging from small MoE systems to models with roughly 400B total parameters. The tested set includes Granite-1B-A400M, OLMoE-1B-7B, DeepSeek-V2-Lite, GPT-OSS-20B, three Qwen3.5 MoE variants, and Llama-4-Maverick. The benchmarks are ARC-Challenge, ARC-Easy, HellaSwag, Winogrande, and TruthfulQA-mc2, evaluated through lm-eval-harness with a vLLM backend.
The representative cutoff is defined as the maximum sparsity at which a model retains 95% of its original average benchmark score. This is not a claim that all downstream tasks tolerate the same sparsity. It is a controlled paper-level rule for comparing models under a shared evaluation setup.
The headline finding is strong: larger MoE models tolerate much higher intra-expert sparsity. The smallest tested model, Granite-1B-A400M, has a sparsity cutoff of 26.5%. The largest tested model, Llama-4-Maverick, reaches 90.8%. The Qwen3.5-35B-A3B and Qwen3.5-122B-A10B models used in the vLLM performance experiments have sparsity cutoffs of 84.5% and 87.4%, respectively.
That is the scientific contribution. The operational contribution is narrower but more valuable: the authors do not stop at “we can zero many activations.” They integrate the sparse path into vLLM’s MoE execution and test whether the savings survive contact with GPU execution.
| Paper component | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Accuracy under increasing sparsity across eight MoE models | Main evidence | Existing pretrained MoE models contain substantial intra-expert activation sparsity | Every task, deployment, or model family can use the same sparsity threshold |
| Routed-only versus routed-plus-shared sparsity | Mechanism analysis | Shared expert neurons are more accuracy-sensitive than routed expert neurons | Shared experts can be ignored in all future sparse designs |
| Router-weight-based neuron budgeting | Exploratory extension | Allocating more neurons to high-router-weight experts gives only small gains in the tested setup | No better budgeting algorithm could matter |
| Activation histogram and per-neuron activation count | Mechanism evidence | Within one expert, activation magnitudes are long-tailed and unevenly used | Neuron pruning is immediately safe without separate validation |
| vLLM sparse execution benchmarks | Implementation evidence | Sparse up/down expert computation can accelerate MoE layer execution | Layer speedup equals end-to-end serving speedup |
| End-to-end vLLM evaluation | Practical boundary | System-level gains are real but modest, topping out around 1.1x to 1.2x in the reported setup | All production workloads will see the same gains |
This table is important because otherwise the paper is easy to misread. The accuracy experiments establish that sparsity exists. The vLLM experiments establish that some of it can be exploited. The end-to-end results establish that production value is constrained by everything else in the serving stack.
Shared experts are not spare parts
The paper’s most useful correction is that not all “expert” computation is equally disposable.
Some MoE architectures include shared experts in addition to routed experts. Routed experts are selected differently for different tokens. Shared experts are always involved. In the experiments, applying sparsity only to routed experts produces stronger accuracy retention than applying sparsity to both routed and shared experts. The authors observe this repeatedly in models with shared experts.
This makes intuitive sense after the fact. Shared experts behave less like optional specialists and more like common infrastructure. If a routed expert is a department called in for a specific case, the shared expert is closer to the office network. You can reduce specialist effort aggressively; you should not unplug the building.
The large Qwen3.5 models illustrate the point. With full shared experts active, the paper reports that less than 2% of the active routed expert neurons can be enough to retain model capacity for each token at the cutoff in those large models. That number is striking, but it should not be converted into a naive “98% of MoE computation is waste” claim. The result depends on shared experts staying intact, on the paper’s benchmark set, and on the 95% average-score retention rule.
The business implication is straightforward. If an inference team wants to explore activation sparsity in MoE serving, the first target should not be “sparsify everything.” It should be routed expert execution, with shared expert computation treated more carefully. A blunt sparsity policy is cheaper to explain and easier to ruin.
Router scores are useful, but not a full neuron budget
A tempting idea is to use router weights to decide not only which experts to activate, but how many neurons inside each active expert deserve computation. Higher router weight, more neuron budget. Lower router weight, fewer neurons.
The paper tests a simple version of that idea by allocating neuron budgets across active routed experts according to router-weight groups. The result is modest: giving more neurons to higher-priority routed experts improves average accuracy, but only up to about two percentage points in the reported setup.
This is best read as an exploratory extension, not the paper’s main thesis. It shows that router weights carry some useful information for within-expert allocation, but the tested allocation rule is not powerful enough to justify inclusion in the end-to-end system. The authors therefore leave more sophisticated neuron budgeting, such as water-filling-like strategies, for future work.
For practitioners, this is a useful restraint. There is a difference between a signal and a product feature. Router weights may help, but a serving system also needs predictable overhead, simple control parameters, and kernels that map cleanly to hardware. A clever allocation rule that saves a few marginal activations but complicates the execution path may be an academic improvement and an operational nuisance. Many production systems already have enough of those.
The inactive neurons are not evenly inactive
The paper also profiles activations within a single expert: expert 0, layer 0 of Qwen3.5-35B-A3B, using WikiText-2 samples at 95% sparsity. The observed activation distribution is long-tailed. The authors report that 54.85% of the activation outputs fall in a near-zero bin between -0.003 and 0.003. At 95% sparsity, the masked values are concentrated in a relatively narrow activation range, while larger-magnitude activations remain active.
The per-neuron count is also uneven. Fewer than 10 neurons receive more than 2,000 activations, about 8.5 times the average count of 235.12. Meanwhile, 284 neurons are never activated in that profile.
This evidence does two jobs.
First, it supports the immediate inference-time idea: many within-expert activations are weak enough that skipping them can preserve benchmark performance under the paper’s thresholding rule. Second, it hints at a separate model-compression possibility: if some neurons are consistently inactive, future work might prune them to reduce model size, not only skip their computation dynamically.
The second point should remain a hint. Dynamic activation sparsity and structural pruning are not the same operation. Skipping low activations for a token is reversible at the next token. Removing neurons from the model is not. The paper mentions pruning as an opportunity, not as a completed result.
vLLM integration: the part where sparsity meets machinery
The authors implement intra-expert activation sparsity inside vLLM by modifying only the routed per-expert execution path. They do not modify the shared expert execution, router, dispatch, or combine operations. That design choice is not laziness. It is the difference between a research prototype and a change that can coexist with existing serving optimizations.
The sparse path has three main components:
- a standalone gate projection;
- a custom sparse activation kernel using threshold-based masking;
- a fused sparse up-down projection kernel.
The key limitation appears immediately. Gate projection must run before the system knows which neurons are inactive, because the gate output is needed to compute the activation mask. In the implementation, gate projection remains dense. That means sparsity applies mainly to the up and down projections, roughly two-thirds of the expert feed-forward computation. The theoretical maximum speedup for that part of execution is therefore bounded around 3x even before other system costs appear.
The thresholding design is also practical. Instead of using a top-k operation to select active neurons, the system converts the user-specified sparsity target into a model-specific threshold at engine startup. A neuron is masked out if the absolute activation value falls below that threshold. This allows single-pass masking and keeps the runtime path simpler.
Then comes the GPU problem. Sparse computation is irregular. Dense GPU kernels are fast partly because they enjoy contiguous memory access, high utilization, and predictable shapes. A sparse kernel that gathers scattered active neurons can save arithmetic while losing some of the hardware regularity that made the dense baseline efficient.
The authors address this by packing active neuron weights into dense blocks in shared memory and by preserving compatibility with HIP Graph and CUDA Graph. They also use a fixed tile granularity of 64 neurons and switch dynamically between sparse and dense execution depending on batch size. When the batch is small enough, the sparse low-latency path helps. When the batch becomes large and dense computation reaches high throughput, the dense path can win.
This is the central mechanism of the paper: activation sparsity creates a compute-saving opportunity, but only a serving-aware implementation can decide when that opportunity is actually cheaper than the overhead needed to exploit it.
Why 90% sparsity becomes 1.2x end-to-end speedup
The speedup numbers are best read in layers.
At the MoE layer level, the sparse execution achieves up to 1.5x to 2.5x speedup at high sparsity levels, especially around the 85% and 87% targets for Qwen3.5-35B-A3B and Qwen3.5-122B-A10B. Peak speedups appear around batch sizes of 16 to 128. At very small batch sizes, fixed overheads such as kernel launch cost, dense gate projection, and activation masking dilute the benefit. At very large batch sizes, dense execution becomes highly efficient and the sparse path loses its advantage.
Hardware matters. The RTX4090, with fewer raw compute and memory resources than the MI355X or H200, shows the highest reported MoE-layer speedup, up to 2.5x. The MI355X reaches up to 1.8x, and the H200 sits between them at up to 2.0x. The larger Qwen3.5-122B-A10B model also sees roughly 30 to 40 percentage points higher speedup across configurations than Qwen3.5-35B-A3B. In plain terms: sparsity helps more when the workload is heavier or the hardware is more constrained.
The accuracy-performance calibration is where the paper becomes sober. For Qwen3.5-35B-A3B on MI355X at batch size 128, the dense baseline has an average benchmark accuracy of 70.9% and MoE-layer time of 0.354 ms. At 85% target activation sparsity, the system achieves 84% total activation sparsity, 94% routed expert sparsity, average accuracy of 67.5%, and time of 0.229 ms, corresponding to 1.55x speedup. That 67.5% score is just above the 95%-of-baseline threshold. At 87% target sparsity, time improves to 0.197 ms and speedup to 1.8x, but accuracy falls to 65.7%, below the 95%-retention rule.
So the practical cutoff is not “maximum sparsity.” It is “maximum sparsity before the quality budget breaks.” This is exactly the kind of boring sentence that saves production teams from exciting incidents.
End-to-end inference is more constrained. With batch-size-based switching enabled, the paper reports up to 1.2x speedup on H200 and up to 1.1x on MI355X, with a minimum speedup of 1.0x across tested configurations. The reason is not mysterious. Dense MoE layer execution accounts for roughly 45% of total execution time in the analyzed setup. Even if that piece speeds up by 1.5x to 1.8x, the rest of the model and the serving engine remain. Attention, embeddings, request scheduling, KV cache management, and output sampling do not vanish because some expert neurons stayed quiet.
Amdahl’s law is not a limitation section. It is the operating manual.
Business value: smaller serving bills, not a free frontier model
For business readers, the paper’s value is not that it makes MoE deployment cheap. It does not. The paper’s value is that it identifies a specific, testable inference optimization for organizations already serving or planning to serve MoE models.
The strongest fit is decode-heavy, latency-sensitive inference where per-step batch sizes often fall in the range where sparse execution beats dense execution. Examples include interactive assistants, long-output generation, coding agents, multi-turn workflow automation, or any serving environment where requests cannot always be packed into large throughput-efficient batches.
The weaker fit is prefill-heavy, high-throughput batch serving where dense kernels are already efficient and the sparse path frequently gets bypassed. In that regime, the system may still avoid losses through dynamic switching, but the upside is naturally smaller.
A practical adoption checklist would look less glamorous than the paper title, which is usually a sign that it might survive contact with reality.
| Business question | What the paper suggests | What must be validated locally |
|---|---|---|
| Can we cut MoE inference cost without retraining? | Possibly, because pretrained MoE models show substantial intra-expert activation sparsity | Whether the target model and workload preserve quality at useful sparsity thresholds |
| Should we sparsify shared experts? | Be careful; shared experts appear more accuracy-sensitive | Whether the architecture has shared experts and how much quality drops under shared sparsity |
| Will 85–90% sparsity cut cost by 85–90%? | No; the reported end-to-end gain is around 1.1x to 1.2x | Full serving-stack profile, not only layer-level timing |
| Which workloads benefit most? | Decode-heavy and latency-bound workloads | Actual request mix, batch-size distribution, token lengths, and hardware |
| Is this mainly a model-science idea or a systems idea? | Both, but business value depends on systems integration | Kernel support, graph compatibility, switching logic, monitoring, and rollback |
This is the right interpretation frame: the paper is about inference-cost engineering. It gives infrastructure teams another lever, not a universal discount code.
Boundaries that matter before anyone files a budget forecast
The first boundary is evaluation scope. The paper uses standard academic benchmarks to define accuracy retention. Those benchmarks are useful for controlled comparison, but enterprise deployments often care about instruction following, factuality under domain documents, tool-use reliability, compliance behavior, code execution success, or workflow completion. A sparsity threshold that preserves average benchmark score may still harm a critical production behavior.
The second boundary is model specificity. The eight tested models provide breadth, and the size trend is important. But a company should not infer that its chosen MoE model has the same cutoff. Architecture details matter: number of experts, active experts, FFN dimension, shared expert design, activation distribution, routing behavior, and serving implementation.
The third boundary is hardware specificity. The paper’s results differ across MI355X, H200, and RTX4090. Sparse execution does not have an abstract speed. It has a speed on a particular GPU, with a particular kernel, batch size, model, and request pattern.
The fourth boundary is maintenance cost. A custom sparse execution path inside a high-performance serving engine is not a one-time patch. It has to keep working as vLLM evolves, kernels change, graph capture assumptions shift, and new model architectures appear. The paper makes this look feasible. It does not make it operationally free.
The final boundary is quality monitoring. Because the sparse path changes computation dynamically, deployment should include quality regression tests at chosen sparsity levels, workload-specific evals, latency and throughput dashboards, and a fast fallback to dense execution. The paper’s dynamic switching by batch size is a performance fallback. Production also needs a quality fallback.
The real lesson: sparsity is a systems property
The clean version of the paper is simple: MoE experts are sparse inside, and skipping inactive neurons can make inference faster.
The useful version is more precise. Existing pretrained MoE models contain substantial intra-expert activation sparsity. The sparsity is stronger in large models. Shared experts are more sensitive than routed experts. Router-weight budgeting gives only modest gains in the tested form. Threshold-based masking can preserve accuracy near the model-specific cutoff. A vLLM implementation can accelerate MoE layers by up to 2.5x, but end-to-end serving gains are closer to 1.1x to 1.2x because only part of the stack is improved.
That may sound less dramatic than the raw 90% sparsity number. Good. Dramatic infrastructure claims usually become quieter after someone reads the profiler.
For Cognaptus readers, the business takeaway is this: the next round of AI efficiency will not come only from smaller models, cheaper GPUs, or more aggressive quantization. It will also come from finding unused computation inside the architectures companies already run, then asking the unfashionable question: can the serving system actually skip it?
In this paper, the answer is yes, but only where the mechanism, model, workload, and hardware line up. That is not a disappointment. That is what real optimization looks like.
Cognaptus: Automate the Present, Incubate the Future.
-
Jongseok Park, Sunga Kim, Zhenyu Gu, Ion Stoica, and Alvin Cheung, “Uncovering Intra-expert Activation Sparsity for Efficient Mixture-of-Expert Model Execution,” arXiv:2605.08575, 2026, https://arxiv.org/abs/2605.08575. ↩︎