Phones have memory. They also have batteries, thermal limits, app sandboxes, operating-system overhead, impatient users, and the charming habit of becoming hand warmers when developers pretend they are cloud GPUs with a smaller logo.
That is the business problem behind MobileMoE, a paper that studies whether Mixture-of-Experts language models can work in the sub-billion-active-parameter regime for on-device deployment.1 The usual MoE story belongs to giant models: add many experts, activate a few, keep per-token compute low, and let the cloud hardware worry about the rest. MobileMoE asks a less fashionable but more commercially useful question: can the same sparse principle survive inside the memory and latency budget of a smartphone?
The answer is not simply “yes, MoE is efficient.” That would be too convenient, and therefore suspicious. The paper’s more useful answer is narrower: on-device MoE works when architecture, training recipe, quantization, and runtime kernel design are all optimized around the same physical constraint. The model cannot merely be sparse on a FLOP spreadsheet. It has to be sparse in a way the phone can actually execute.
That is why this paper should be read mechanism-first, not benchmark-first. The benchmark tables matter, but they are downstream evidence. The real contribution is the chain: mobile memory and compute constraints define a scaling problem; the scaling law chooses moderate sparsity, fine-grained experts, and a shared expert; training and QAT preserve capability; and a custom fused MoE operator turns the theoretical savings into measured smartphone latency.
The real bottleneck is not “small model quality”; it is memory-qualified compute
Most discussions of on-device LLMs still sound like a dense-model diet plan: shrink the model, quantize the weights, accept some quality loss, and hope users do not notice. MobileMoE starts from a different premise. On a device, active compute and total memory are not the same resource.
Dense models tie them together. More parameters usually mean more active computation per token. MoE breaks that coupling: total parameters can rise because many experts exist, while active parameters stay lower because only a subset is routed for each token. This is exactly why MoE became attractive for frontier-scale models. But the phone version has a trap: total parameters still have to live somewhere. A model that is sparse in compute but too large in weight memory is not “efficient”; it is just unemployed.
MobileMoE therefore formulates an on-device MoE scaling law that jointly considers active parameters, training data, expert count, expert granularity, shared expert design, per-token inference compute, and device memory. The paper uses a practical memory proxy that includes quantized model weights and KV cache, with 4-bit weights and 8-bit KV cache as the assumed deployment-oriented regime. The authors treat roughly 5 GB as a practical upper budget for current smartphone app usage, not as a universal law of nature.
The important move is conceptual. The paper does not ask, “How much MoE can we add before the benchmark improves?” It asks, “Which MoE shape gives lower loss while staying inside the device’s memory and compute envelope?” That is a much better business question. It starts from deployment constraints, not leaderboard aesthetics.
The architecture result is moderate sparsity, not expert maximalism
The first likely misconception is that MoE means more experts are always better. This is the familiar buffet mistake: because choice is good, infinite choice must be excellent. Hardware, sadly, has not read that memo.
MobileMoE studies three design axes: the number of experts $E$, expert granularity $g$, and whether to include a shared expert. The paper’s scaling-law ablations are not side decorations; they are the mechanism that justifies the final architecture.
| Design choice | Likely purpose of the test | What the paper finds | What it means operationally |
|---|---|---|---|
| Number of experts $E$ | Main scaling-law ablation | MoE can beat dense models at fixed memory and compute, but returns diminish; $E=8$ is the practical sweet spot under the on-device memory regime. | Sparse capacity helps, but excessive total parameters become memory baggage. |
| Expert granularity $g$ | Architecture ablation | Fine-grained experts improve loss at fixed compute, with diminishing returns beyond $g=8$. | Smaller expert slices give the router more compositional flexibility without increasing total or active parameters. |
| Shared expert | Architecture ablation | Adding an always-on shared expert improves loss at fixed compute. | A generalist path complements routed specialists, reducing the risk that every token must be handled only by sparse specialists. |
| Training-efficiency curves | Robustness and implementation sanity check | $E=8$, $g=8$, and shared expert are also efficient in wall-clock training behavior; $g=16$ adds overhead with little loss reduction. | The chosen design is not just curve-fit elegant; it is trainable on real hardware without becoming a science-fair furnace. |
The final MobileMoE configuration uses moderate sparsity, fine-grained experts, top-$k$ routing, and a shared expert. Across the S/M/L family, the models have about 272M, 528M, and 922M active parameters, while total parameters rise to about 1.3B, 2.8B, and 5.3B. This is the key separation: active compute remains sub-billion scale, while total capacity is larger.
For business readers, the translation is simple: MobileMoE is not selling “more model” or “less model.” It is selling a more useful ratio between capability-bearing memory and per-token work. That ratio matters when inference happens on a phone, a wearable, a vehicle computer, or a robot where every extra watt has somewhere better to be.
The training recipe is where sparse architecture becomes a usable model
Architecture alone does not make a deployable model. It makes a nice diagram. MobileMoE then applies a four-stage recipe: pre-training, mid-training, supervised fine-tuning, and INT4 quantization-aware training.
The pre-training stage uses 6T tokens at 2K context, with a web-heavy but domain-diverse mix covering math, code, knowledge, and science. The paper argues that this diversity supports expert specialization. That is plausible and supported by the later expert-utilization analysis, where different downstream domains activate different expert subsets.
Mid-training extends context length to 8K and shifts the data mixture toward higher-quality domain-specific sources. This is not a ceremonial second pass. In the paper’s stage analysis, mid-training is where knowledge and reading comprehension receive large gains. Supervised fine-tuning then improves instruction and reasoning behavior, especially GSM8K-style elicited reasoning.
The MoE-specific training details are worth noticing because they are easy to ignore until the model collapses politely. The recipe uses auxiliary-loss-free balancing, router z-loss regularization, sigmoid gating with top-$k$ normalization, FP32 router computation, grouped MLP kernels during training, drop-and-pad dispatch during pre-training, and dropless dispatch during SFT. The change from drop-and-pad to dropless dispatch during SFT is a good example of engineering judgment: dropping tokens from structured instruction-response examples would distort the learning signal. A sparse model that learns from damaged instructions is not efficient. It is merely confused at lower FLOPs.
Then comes INT4 QAT. This stage is essential because the business use case is not BF16 research inference; it is deployment under mobile memory constraints. MobileMoE quantizes linear weights to symmetric group-wise INT4 with group size 32, dynamically quantizes activations to INT8, and keeps router weights in FP32. Keeping the router precise is a small memory cost and a sensible stability tradeoff. In MoE, the router is not clerical paperwork. It decides which subnetworks think.
The benchmark story is a Pareto frontier, not a victory parade
The paper reports strong results, but the honest interpretation is not “MobileMoE wins everything.” The better interpretation is that MobileMoE improves the tradeoff curve for small on-device models.
On 14 foundational benchmarks, MobileMoE-Base reaches 46.5, 55.4, and 59.8 average accuracy for S/M/L after mid-training. MobileMoE-L, with 922M active parameters, beats OLMoE-1B-7B Base’s 52.4 average despite OLMoE having 1.3B active and 6.9B total parameters. MobileMoE-M also exceeds OLMoE-1B-7B Base on this foundational average, despite using far fewer active and total parameters.
After instruction fine-tuning, MobileMoE-S/M/L score 46.7, 55.3, and 60.1 on the same 14 foundational benchmarks. On advanced benchmarks, MobileMoE-L reaches 44.4 overall, with particularly strong code and math performance: 58.8 average on code and 41.2 on math. The paper also notes that Qwen3.5 2B remains stronger on instruction following and some harder knowledge-and-reasoning settings, likely helped by a more advanced post-training recipe, including distillation and thinking-oriented behavior.
That boundary matters. MobileMoE’s result is not that sparse small models automatically beat every dense model. The result is that, under constrained active parameters and mobile deployment assumptions, the MobileMoE design produces a better quality-compute-memory tradeoff than the tested dense and MoE baselines.
The QAT results are even more deployment-relevant. After INT4 QAT, MobileMoE-S/M/L have static weight footprints of 0.68 GB, 1.48 GB, and 2.75 GB, with average foundational accuracy of 44.0, 52.5, and 57.8. MobileLLM-Pro’s QAT baseline has 0.55 GB weight memory and 45.5 average accuracy. So MobileMoE-S is slightly lower in accuracy and slightly higher in static weight memory than MobileLLM-Pro, but it becomes important in runtime because its active compute is much smaller. MobileMoE-M and L buy substantially higher accuracy with larger memory footprints.
A compact way to read the evidence is this:
| Paper result | What it directly shows | Business meaning | Boundary |
|---|---|---|---|
| Scaling-law ablations select $E=8$, $g=8$, shared expert | Main mechanism for MobileMoE architecture selection | On-device models should be designed around memory-qualified sparse compute, not copied from cloud MoE recipes | The fitted regime is sub-billion active parameters and the tested architecture family |
| MobileMoE-Base and SFT form a stronger Pareto curve | Main benchmark evidence | Smaller active compute can preserve or improve quality if total sparse capacity is used well | Benchmark transfer to a specific app task is not guaranteed |
| INT4 QAT preserves most capability | Deployment evidence | 4-bit mobile deployment is feasible without destroying the router/expert system | Uses a specific QAT recipe; PTQ may behave worse |
| Smartphone profiling shows latency gains | Main systems evidence | Sparse compute can reduce real user latency only when supported by a fused runtime operator | Tested on Samsung Galaxy S25 and iPhone 16 Pro, with specific CPU/GPU backends |
| Real prompts reveal higher MoE RSS than dummy prompts | Profiling robustness warning | Benchmarking MoE memory with repeated dummy tokens can understate production memory | Real app prompts may activate different expert patterns again |
The last row deserves applause, faint but genuine. Many papers benchmark with artificial inputs because artificial inputs are clean. MobileMoE explicitly shows why that can be misleading for MoE memory: real prompts activate more diverse experts, which raises peak RSS compared with dummy repeated tokens. In other words, dummy prompts make sparse models look tidier than customers will.
The phone runtime result is where the paper earns its title
The most commercially interesting part of the paper is not that MobileMoE has fewer active FLOPs. It is that the authors deploy it on commodity smartphones and profile it with a custom inference path.
Existing mobile CPU inference stacks already have optimized dense INT4 matrix multiplication. They do not naturally know how to run a routed MoE feed-forward layer efficiently. If each expert dispatch becomes many small operations, MoE’s theoretical advantage can vanish inside kernel overhead, memory movement, and poor batching. The spreadsheet says sparse; the phone says no.
MobileMoE addresses this by implementing a custom fused MoE operator in ExecuTorch. The operator reorders tokens by assigned expert IDs, processes each expert’s token slice as a dense batched matrix multiplication through torchao’s INT4 GEMM path, and fuses top-$k$ selection, dispatch, projections, SwiGLU activation, down projection, and scatter/unpermute into a single operator. Attention and embeddings continue to use the dense XNNPACK INT4 path.
This is the bridge between research architecture and product architecture. Without it, MobileMoE would be another model that looks efficient in theory and trips over its shoelaces during deployment.
The latency numbers are therefore meaningful. At comparable INT4 weight memory and similar QAT accuracy, MobileMoE-S is faster than MobileLLM-Pro across Samsung Galaxy S25 CPU, iPhone 16 Pro CPU, and iPhone 16 Pro GPU profiling. Across the reported CPU and GPU settings, MobileMoE-S delivers roughly 1.8–3.8x faster prefill and 2.2–3.4x faster decode. The exact ratio varies by device, backend, and context length, which is precisely what one should expect from hardware-sensitive inference.
The mechanism differs by phase. Prefill is compute-bound, so fewer active parameters reduce feed-forward computation. Decode is more memory-bandwidth-bound, so reading fewer active expert weights per token improves throughput. This distinction matters for product planning. A chat interface with short prompts and long generation does not stress the system the same way as long-document summarization with an 8K input. “On-device latency” is not one number. It is a workload profile wearing a convenient fake mustache.
Peak runtime memory adds another layer. On Samsung Galaxy S25, MobileMoE-S has 1.49 GB peak RSS at 8K context under real prompts, compared with 1.91 GB for MobileLLM-Pro. MobileMoE-M and L use more RAM but also deliver higher accuracy, with MobileMoE-L staying under 5 GB peak RSS at 8K context in the reported profiling. The paper treats peak RSS as a conservative upper bound because it includes resident weights, KV cache, transient activations, and runtime overhead.
The practical conclusion is not that every app should ship MobileMoE-L tomorrow. The conclusion is that the mobile deployment envelope is wider than dense-model thinking suggested. Sparse routing can be made operational if the runtime is designed for it.
The business pathway: local intelligence becomes an architecture decision
For businesses, MobileMoE points to a shift in how on-device AI should be evaluated. The old question was: “What is the best small dense model we can fit?” The better question is: “What combination of total capacity, active compute, quantization, router stability, KV cache size, and runtime kernel support gives the best user experience under our workload?”
That shift matters in at least four product categories.
First, consumer assistants. A phone assistant that can handle more requests locally reduces cloud inference cost and improves perceived latency. Privacy also improves when fewer prompts leave the device, although privacy is not automatic merely because a model is local. Logs, telemetry, fallback routing, and app permissions still exist. Reality remains annoying.
Second, enterprise mobile apps. Field-service, sales, compliance, and medical-adjacent workflows often involve sensitive text, intermittent connectivity, and fast interaction needs. On-device MoE could allow more local drafting, extraction, classification, and retrieval-augmented interaction, especially when paired with small local indexes.
Third, wearables and embodied agents. These devices face tighter power and memory constraints than flagship phones, so the MobileMoE result is not directly portable. But the design logic is relevant: use sparse active compute, keep memory predictable, and optimize runtime around actual routing behavior rather than pretending dense kernels solve everything.
Fourth, AI infrastructure strategy. For platform companies, the paper suggests that competitive advantage may come less from releasing one tiny model and more from owning the full deployment stack: architecture search, data recipe, quantization, router stability, fused kernels, profiling harnesses, and device-specific execution paths. This is less glamorous than saying “agentic AI,” which is partly why it may matter more.
What remains uncertain before this becomes a product recipe
The paper is strong because it goes beyond benchmark tables into phone deployment. Still, several boundaries should stay visible.
The hardware scope is specific: Samsung Galaxy S25, iPhone 16 Pro, CPU backends through ExecuTorch and XNNPACK, and GPU profiling through MLX on iPhone. Results may change with older phones, cheaper Android devices, NPUs, thermal throttling, background app behavior, and vendor-specific memory policies.
The workload scope is broader than dummy prompts but still controlled. The paper uses real prompts across code, knowledge, and math, which is a better test than repeated tokens. But production apps will have messier distributions: multilingual input, noisy speech transcripts, retrieval context, tool outputs, safety policies, and long sessions with accumulated KV cache.
The model scope is also clear. MobileMoE is a text LLM family. The authors explicitly identify multimodal extensions as future work. For many edge products, especially camera-first assistants, robotics, and AR, the multimodal path may be more important than the text-only result.
Finally, the training recipe is not free. The paper uses serious training infrastructure and multiple stages. A company adopting this direction is not merely swapping a model file. It is committing to a deployment-aware model-development pipeline.
The conclusion: MoE has moved from cloud spectacle to mobile plumbing
MobileMoE is interesting because it makes MoE less theatrical. The expert model is no longer just a giant cloud architecture flexing total parameter count. It becomes a plumbing problem: which weights must be resident, which experts must activate, which tokens go where, which kernels fuse the route, and which memory budget the whole performance has to respect.
That is a healthier way to think about on-device AI. The future of local assistants will not be decided only by who can compress a dense model most aggressively. It will also depend on who can design models where capacity, routing, quantization, and runtime execution cooperate under physical constraints.
In that sense, MobileMoE’s main message is not “phones can run MoE.” The sharper message is: phones can run MoE when MoE is designed for phones. Sparse architecture is the beginning. The kernel is the receipt.
Cognaptus: Automate the Present, Incubate the Future.
-
Yanbei Chen, Hanxian Huang, Ernie Chang, Jacob Szwejbka, Digant Desai, Zechun Liu, Vikas Chandra, and Raghuraman Krishnamoorthi, “MobileMoE: Scaling On-Device Mixture of Experts,” arXiv:2605.27358v1, May 26, 2026, https://arxiv.org/abs/2605.27358. ↩︎