Cover image

MoA Than One Curve: Teaching FFNs to Choose Their Nonlinearity

Model architecture has a recurring habit: when something works, we freeze it into a default and move the argument elsewhere. Attention gets the drama. Routing gets the diagrams. Context windows get the product demos. Meanwhile, the feedforward network sits there, quietly holding a large share of the parameters and applying the same nonlinearity to every token, every time, as if “one curve fits all” were a law of nature rather than a convenient engineering choice. ...

June 7, 2026 · 17 min · Zelina
Cover image

Beam Me Less, Scotty: MoE Models Learn When Not to Call Every Expert

Latency has a way of turning elegant model architecture into an invoice. Mixture-of-Experts models were supposed to soften that invoice. Instead of sending every token through the same dense feed-forward machinery, an MoE layer sends each token to only a few experts. In theory, this gives us scale without paying for all parameters on every token. In practice, many deployed MoE models still behave like a restaurant that insists every guest order the same number of dishes. The experts differ, but the billable count is fixed. ...

June 4, 2026 · 15 min · Zelina
Cover image

The Experts Are Sparse Inside: Why MoE Cost Cuts Stop at 1.2x

The Experts Are Sparse Inside: Why MoE Cost Cuts Stop at 1.2x Cost has a way of making architecture fashionable. Mixture-of-Experts models became attractive because they promise a pleasant bargain: keep a large total parameter count, but activate only a small part of the model for each token. In business language, that sounds like capacity without the full compute bill. In engineering language, it means routing each token to a few expert feed-forward networks instead of running every expert all the time. ...

May 27, 2026 · 16 min · Zelina
Cover image

No Prompt Left Behind: How Shopee’s CompassMax Reinvents RL for Giant MoE Models

Rollouts are expensive little creatures. They consume GPU time, produce long reasoning traces, wait for reward computation, and then—if the reward signal is flat—contribute exactly nothing to learning. The GPU was busy. The training dashboard looked serious. The model learned no usable distinction. Very productive, in the same way a meeting with twelve people and no decision is productive. ...

December 9, 2025 · 18 min · Zelina

LLaMA 4 Maverick 17B 128E (Original)

Meta’s experimental ultra-sparse MoE model with 128 experts, designed to explore efficient large-scale scaling and routing strategies for future LLaMA architectures.

1 min

LLaMA 4 Scout 17B 16E

Meta’s experimental LLaMA 4-series MoE model with 17 billion parameters and 16 experts, designed to explore sparse routing and scaling strategies.

1 min

LLaMA 4 Scout 17B Instruct (Unsloth, 4-bit)

A 4-bit quantized, instruction-tuned variant of Meta’s LLaMA 4 Scout MoE model, optimized by Unsloth for efficient fine-tuning and deployment.

1 min

Mixtral 8x7B Instruct v0.1

A powerful sparse Mixture-of-Experts (MoE) instruction-tuned language model by Mistral AI, combining efficiency and performance for chat and task-oriented generation.

1 min