Provider: Meta AI
License: Meta Llama Maverick License (research-only, no commercial use)
Access: Open weights via Hugging Face
Architecture: Sparse Mixture-of-Experts (MoE), Top-2 Routing
Experts: 128 total experts, 2 active per forward pass
Parameters: 17B total


🔍 Overview

LLaMA 4 Maverick 17B 128E is an ultra-sparse, experimental MoE model from Meta’s LLaMA 4 research series. It pushes the boundary of sparse expert design by increasing the number of experts to 128 while maintaining high performance with minimal per-token compute.

Key features:

  • 🧪 High-Sparsity MoE: Only 2 of 128 experts are active per token
  • 🧠 Scalable Design: Explores large-scale routing and activation for efficient scaling
  • 🔍 Research Preview: Released to investigate inference dynamics of ultra-sparse models

⚙️ Technical Details

  • Model Type: Decoder-only transformer with MoE layers
  • Experts: 128 total, Top-2 routing per token
  • Active Parameters per Token: ~4.7B (2 experts only)
  • Tokenizer: LLaMA tokenizer family
  • Training: Focused on routing diversity, sparsity effects, and compute efficiency

🚀 Deployment

  • Model Card: LLaMA 4 Maverick on Hugging Face
  • Tools: Requires custom inference logic (MoE support in PyTorch or DeepSpeed)
  • Use Cases: Sparse model benchmarking, MoE routing strategy experiments, LLM scaling research

🔗 Resources