MoE Money, MoE Problems: Expert Capacity Finally Gets a Manager

TL;DR for operators

Mixture-of-Experts models are supposed to give businesses the best of both worlds: lots of parameters for capability, few active parameters for cost. Lovely on the slide. Messier in the server room.

Two recent papers make the same larger point from opposite sides of the MoE machinery. SoftMoE attacks the compute-allocation problem: why should every token, in every layer, use the same fixed number of experts just because the architecture designer had to choose a value for top-$k$?¹ Tied Expert Layers attacks the memory problem: why should every layer store its own expert FFNs when many of those expert weights may be redundant across nearby layers?²

The combined lesson is not “MoE is solved.” It is more useful than that. The lesson is that MoE efficiency is becoming a resource governance problem inside the model. Expert capacity has to be allocated, reused, protected, and reinvested across depth.

For business and platform teams, this changes the due-diligence question. Do not ask only: “How many experts does the model have?” That is the architectural equivalent of asking how many desks are in an office and calling it a productivity audit. Ask instead:

Operator question	Why it matters
How many experts are active per token, per layer?	Determines compute cost and latency.
Are active experts fixed or adaptive?	Determines whether compute is wasted on easy tokens and underused on hard ones.
How many unique expert parameters must be stored?	Determines memory footprint, optimizer state, and deployment friction.
Which layers actually need unique expert capacity?	Determines where depth-aware allocation or tying may help.
Are saved parameters removed or reinvested?	Determines whether the gain becomes lower cost or better quality at the same budget.

The annoying but important conclusion: MoE efficiency is no longer just sparsity. It is architecture-level budgeting.

The MoE bargain is starting to show its accounting problem

MoE architectures became attractive because they decouple total model capacity from per-token computation. A dense model uses the same large block of parameters for every token. An MoE model keeps many expert FFNs available but activates only a subset for each token. In principle, this lets a model carry more representational capacity without paying the full compute bill on every forward pass.

That bargain has a catch. Actually, two.

First, the compute side can be rigid. Standard sparse MoE routing often uses hard top-$k$: each token is sent to exactly $k$ experts. The model may learn which experts to use, but the number of active experts is set in advance. Easy token? Hard token? Early layer? Late layer? Congratulations, everyone gets the same architectural ration card.

Second, the memory side can be bloated. Even if only a few experts are active for a given token, the system must still store the full expert pool. During training, optimizer states and gradient communication make this even less amusing. During inference, memory residency and bandwidth become a serious deployment constraint. Sparse compute is nice; idle parameters still pay rent.

The two papers in this cluster fit together because they attack these two waste channels.

SoftMoE says: make active expert use adaptive under a compute budget.

Tied Expert Layers says: make expert parameters reusable under a memory budget.

Together, they point toward a more mature view of MoE systems: not as collections of more and more experts, but as layered resource markets where compute, memory, routing, attention, and parameter reuse each need a budget.

The logic chain: from expert count to expert governance

The combined argument is best read as a chain, not as two separate technique summaries.

MoE scaling separates total parameters from active compute.
That separation creates two forms of waste: rigid activation and idle memory.
SoftMoE reduces rigid activation by learning how many experts to activate across tokens and layers.
Tied Expert Layers reduces idle memory by sharing expert FFN weights across nearby layers.
Together, they imply that future MoE design is about managing expert capacity across depth, not merely expanding the expert pool.

That final step is the business-relevant one. The practical value is not that every company should immediately rebuild its stack around these exact methods. The practical value is that these papers make the cost structure visible.

A useful mental model:

$$ \text{MoE value} \neq \text{total experts} $$

A better one:

$$ \text{MoE value} \approx \frac{\text{quality retained or gained}} {\text{active compute} + \text{unique memory} + \text{communication overhead} + \text{operational complexity}} $$

Yes, the denominator is less glamorous. Unfortunately, invoices also live in the denominator.

What SoftMoE shows: active compute should not be a fixed ritual

SoftMoE starts with a familiar MoE weakness: hard top-$k$ expert selection is discrete and therefore awkward for gradient-based optimization. The model can learn routing scores, but the hard selection itself blocks direct differentiation through the choice of active experts. It also fixes the active expert count in advance.

The paper replaces hard top-$k$ routing with a differentiable relaxation based on a LapSum soft top-$k$ operator. In practical terms, the router produces soft expert selection weights, then low-contribution weights are truncated so the computation remains sparse. This gives the model a route to adaptive behavior: different tokens can activate different numbers of experts, and the mean number of active experts can be learned across layers.

The key budget idea is simple:

$$ \sum_{\ell=1}^{L} k_\ell \le B,\quad k_\ell \ge 1 $$

Here, $k_\ell$ is the expected number of active experts in layer $\ell$, and $B$ is the global active-expert budget. The point is not to let the model use everything. That would be dense computation wearing a sparse-model costume. The point is to force layers to compete for expert capacity.

When SoftMoE is allowed to learn this allocation, the paper reports a strongly non-uniform pattern: later layers receive more active expert budget, with the top three transformer layers absorbing roughly half of the total budget in the reported experiments. The authors interpret this as evidence that later stages of token processing may benefit more from conditional capacity than earlier or middle layers.

This matters because many MoE systems still treat expert count per layer as a uniform architectural constant. SoftMoE’s result suggests that uniformity may be a convenience, not an optimum.

The operator translation

SoftMoE is not merely a routing trick. It is a budgeting mechanism.

Design choice	Conventional sparse MoE	SoftMoE interpretation
Active experts per token	Fixed by top-$k$	Can vary by token after truncation
Expert budget across layers	Usually uniform	Learnable under a global constraint
Optimization signal	Indirect around hard selection	Differentiable through soft selection and allocation
Business implication	Compute plan is static	Compute can be allocated where it creates more value

The business question becomes: Are you paying the same expert-compute tax for every token and every layer because the model needs it, or because the architecture never learned how not to?

SoftMoE’s answer is that some of that compute can be reallocated. The paper reports comparable or better language-modeling and downstream results against sparse MoE baselines while activating fewer experts on average in many configurations. It also notes that learned allocation improves language-modeling performance but can activate more experts on average, reflecting a quality-efficiency trade-off rather than free magic. Sensible. Deep learning remains rude that way.

The limitation is also important. The experiments use English corpora and English downstream tasks, and the architecture scale is 1.63B parameters. That is substantial research scale, not frontier deployment proof. The right takeaway is not “replace your router on Monday.” It is “fixed top-$k$ may be an under-managed compute budget.”

What Tied Expert Layers shows: idle expert memory is not sacred

The second paper approaches the same MoE efficiency problem from the opposite side. It asks: even if MoE saves compute by activating only a few experts, why must every layer store its own separate expert FFN weights?

Tied Expert Layers proposes expert tying: share expert FFN weights across consecutive transformer layers while keeping attention, routing, and normalization layer-specific. In other words, the expert pool can be reused across a small depth window, but the surrounding layer machinery still changes the hidden state and routing context.

This is a subtle but important distinction. The method is not fully looping the entire transformer block. It is sharing the expensive MoE FFN sub-block while preserving per-layer attention. The paper’s component ablations are the interesting part: they find that per-layer attention, not routing alone, is the main mechanism that keeps tied layers distinct in terms of loss.

That creates a useful tension with SoftMoE.

SoftMoE treats routing flexibility as the core lever for compute allocation. Tied Expert Layers finds that router diversity is visible, but attention carries more of the burden for maintaining useful layer distinctness when expert weights are shared. These are not contradictions. They are different answers to different resource questions.

SoftMoE asks:

How many experts should be active, and where?

Tied Expert Layers asks:

How many unique expert weights do we actually need to store?

The answers can coexist. In fact, they probably have to.

The operator translation

Tied Expert Layers reframes expert parameters as reusable infrastructure rather than layer-local property.

Design choice	Standard MoE	Tied Expert Layers
Expert FFN weights	Unique per layer	Shared across consecutive layers
Attention	Unique per layer	Kept unique per layer
Router	Kept layer-specific in the main recipe	Layer-specific, with router health monitored
Memory footprint	Full expert pool stored across all layers	Reduced unique expert parameters
Saved budget	Becomes compression or can be reinvested	Can widen tied expert pools at similar parameter budget

The paper reports that tying experts in groups can preserve quality with limited degradation, and that at 7B scale expert tying can match the untied baseline while roughly halving the model parameter footprint. It also reports that reinvesting saved parameters into wider tied middle-layer expert pools can outperform the untied baseline at comparable parameter count.

That last point is where this becomes more than compression. Compression says: “Can we make the model smaller without hurting it too much?” Reinvestment says: “Can we spend the same parameter budget differently and get a better architecture?”

That is the more interesting business idea. A saved parameter is not automatically a cost saving. It is an option. You can bank it as lower memory cost, or you can reinvest it as wider expert capacity.

This is how engineering becomes finance, except with fewer suits and more CUDA errors.

The important tension: routing matters, but not always in the same way

The two papers create a useful tension around routing.

SoftMoE says routing needs to be more flexible. Hard top-$k$ fixes the active expert count and prevents learned expert budgets across depth. Its contribution is to make expert activation differentiable enough to learn more adaptive allocation patterns.

Tied Expert Layers says routing diversity is not the primary reason tied layers stay capable. In its component ablation, freeing routers changes cross-loop routing agreement, but per-layer attention has the larger effect on validation loss. The paper’s conclusion is not “routers do not matter.” It is more precise: when expert FFNs are shared across nearby layers, attention is the more important layer-distinguishing component.

For business readers, this is the difference between allocation and identity.

SoftMoE uses routing to decide where active compute goes. Tied Expert Layers studies what makes one layer operationally distinct from another when some parameters are shared. Routing can matter a great deal for the first problem while attention carries more of the second.

That distinction prevents a common misunderstanding: “The router is the MoE brain.” Sometimes. Other times it is the traffic cop. And occasionally the road design matters more than the cop’s personality.

A framework for evaluating MoE efficiency

The combined papers suggest a practical four-part framework for evaluating MoE designs.

1. Active compute: what is actually used?

Headline parameter count is often misleading. In MoE systems, the deployed cost is shaped by how many experts are active per token, how often, and in which layers.

SoftMoE shows that active expert count need not be uniform. Tokens and layers can receive different amounts of expert compute under a global budget. For operators, this means the relevant metric is not simply top-$k$, but the distribution of active experts across traffic.

Questions to ask:

Is active expert count fixed or adaptive?
Is the active budget layer-uniform or learned?
Does the model expose active-expert statistics during evaluation?
Does the active compute distribution shift by task, prompt type, or domain?

A model that averages low active compute but spikes unpredictably may have different serving economics than one with a stable profile. Mean cost is not a capacity plan.

2. Unique memory: what must be stored?

Tied Expert Layers reminds us that sparse activation does not remove the need to store the full expert pool. This matters in training because optimizer states amplify memory demands. It matters in inference because memory bandwidth and residency constraints often dominate deployment feasibility.

Questions to ask:

How many unique expert parameters must be resident?
Are expert weights unique per layer, shared, cached, paged, or otherwise reused?
What is the optimizer-state cost during training or fine-tuning?
Does parameter sharing reduce communication under distributed training?

If active compute is the restaurant bill, unique memory is the rent. Both matter. Only one tends to look impressive in model announcements.

3. Depth distinctness: what must remain layer-specific?

Not every component should be shared. Tied Expert Layers argues that first and last layers should remain untied, and that per-layer attention is important for preserving layer distinctness when expert FFNs are shared. This creates a more nuanced design rule: share the largest redundant component, protect the components that preserve function.

Questions to ask:

Are first and last layers treated differently from the middle stack?
Which components are shared: expert FFNs, routers, attention, normalization, or full blocks?
Is there ablation evidence showing what sharing costs?
Does the model preserve router health and expert utilization under tying?

The crude version of parameter sharing is “reuse everything.” The better version is “reuse what is redundant and protect what differentiates.” One of these is architecture design. The other is a spreadsheet having an episode.

4. Budget reinvestment: where do savings go?

Both papers imply that efficiency gains create choices.

SoftMoE can use fewer experts for similar performance in some settings, or use learned allocation to improve performance with a different expert-use profile. Tied Expert Layers can reduce memory footprint, or reinvest saved parameters into wider expert pools. The right business decision depends on the constraint.

Constraint	Likely priority
Serving latency	Reduce active experts or stabilize active-compute variance
GPU memory	Reduce unique expert parameters through tying or compression
Training communication	Reduce unique tensors and optimizer-state movement
Quality at fixed budget	Reinvest saved parameters into width or targeted capacity
Edge deployment	Combine low active compute with lower unique memory

The managerial mistake is to treat “efficiency” as one number. It is not. Efficiency is a portfolio of trade-offs.

What this means for AI buyers and platform teams

Most businesses will not implement SoftMoE or tied expert layers directly. That is not the point. The point is that these papers provide better questions for procurement, vendor evaluation, internal architecture review, and deployment planning.

When evaluating MoE-based models or model-serving claims, ask for evidence along five dimensions:

Evidence category	What to request
Active expert profile	Average and tail active experts per token, broken down by layer
Memory footprint	Unique parameter count, optimizer-state cost, and inference residency assumptions
Routing stability	Load balance, entropy, expert collapse checks, and task-level variation
Layer strategy	Whether expert capacity is uniform, learned, tied, or widened in specific regions
Cost-quality trade-off	Quality at matched active compute and matched total memory, not just headline benchmarks

The last one is essential. A model can look efficient because it uses less compute but silently consumes more memory. Another can look compact because it stores fewer parameters but loses quality on tasks that matter. Another can win benchmarks while producing an operational profile that makes the infrastructure team stare into the middle distance.

The useful comparison is not “Which model has more experts?” It is:

$$ \text{Quality per operational dollar at your workload mix} $$

Academic papers rarely phrase it that way, because reviewers are not billed by your cloud provider. You are.

What not to overread

There are limits.

SoftMoE is tested at 1.63B parameters, on English language-modeling corpora and English downstream tasks. Its learned later-layer allocation is interesting, but it should not be treated as a universal law of all MoE models, modalities, or training regimes. It is a strong signal, not scripture.

Tied Expert Layers reaches 7B scale and reports promising quality and throughput results, but the authors explicitly note that frontier scale, longer-horizon training, and uniformly dominant width expansion remain unproven. Its implementation also does not use tied-layer-aware optimized kernels, which means reported efficiency gains may not reflect the final engineering ceiling.

The combined conclusion is therefore measured:

The papers do not prove one final MoE architecture.
They do not eliminate routing, memory, or communication headaches.
They do not say more experts are useless.
They do show that where and how expert capacity is used may matter as much as how much expert capacity exists.

That is enough to change the conversation.

The strategic takeaway: MoE is becoming an operating system for capacity

MoE began as an appealing scaling trick: activate a few experts, store many, get capability without dense compute. The next phase looks less like a trick and more like resource scheduling.

SoftMoE turns active expert computation into something that can be learned across tokens and layers. Tied Expert Layers turns expert parameters into something that can be reused across depth without fully collapsing layer identity. One manages the compute budget. The other manages the memory budget. Together, they point toward a new design principle:

Expert capacity should be treated as a depth-aware resource, not a uniform architectural decoration.

For operators, this means MoE model evaluation should move past the easy numbers. Total parameters, active parameters, and expert count are not irrelevant, but they are insufficient. The more useful questions are about allocation, reuse, bottlenecks, and reinvestment.

The market will probably still advertise giant expert counts, because large numbers enjoy excellent public relations. But inside serious AI infrastructure work, the question is shifting.

Not “How many experts do we have?”

More like:

“Which experts are worth paying for, where, and how many times can we reuse them before the model notices?”

Finally, expert management. Middle management had to find a purpose eventually.

Cognaptus: Automate the Present, Incubate the Future.

Mikołaj Zasada, Łukasz Struski, Jacek Tabor, and Marcin Kurdziel, “SoftMoE: Soft Differentiable Routing for Mixture-of-Experts in LLMs,” arXiv:2606.17952, 2026. https://arxiv.org/abs/2606.17952 ↩︎
Martin Jaggi, “Tying the Loop — Tied Expert Layers in Mixture-of-Experts Language Models,” arXiv:2606.16825, 2026. https://arxiv.org/abs/2606.16825 ↩︎

TL;DR for operators#

The MoE bargain is starting to show its accounting problem#

The logic chain: from expert count to expert governance#

What SoftMoE shows: active compute should not be a fixed ritual#

The operator translation#

What Tied Expert Layers shows: idle expert memory is not sacred#

The operator translation#

The important tension: routing matters, but not always in the same way#

A framework for evaluating MoE efficiency#

1. Active compute: what is actually used?#

2. Unique memory: what must be stored?#

3. Depth distinctness: what must remain layer-specific?#

4. Budget reinvestment: where do savings go?#

What this means for AI buyers and platform teams#

What not to overread#

The strategic takeaway: MoE is becoming an operating system for capacity#

TL;DR for operators

The MoE bargain is starting to show its accounting problem

The logic chain: from expert count to expert governance

What SoftMoE shows: active compute should not be a fixed ritual

The operator translation

What Tied Expert Layers shows: idle expert memory is not sacred

The operator translation

The important tension: routing matters, but not always in the same way

A framework for evaluating MoE efficiency

1. Active compute: what is actually used?

2. Unique memory: what must be stored?

3. Depth distinctness: what must remain layer-specific?

4. Budget reinvestment: where do savings go?

What this means for AI buyers and platform teams

What not to overread

The strategic takeaway: MoE is becoming an operating system for capacity