Provider: Meta AI
License: Meta Llama Maverick License (research-only, no commercial use)
Access: Open weights via Hugging Face
Architecture: Sparse Mixture-of-Experts (MoE), Top-2 Routing
Experts: 128 total experts, 2 active per forward pass
Parameters: 17B total
🔍 Overview
LLaMA 4 Maverick 17B 128E is an ultra-sparse, experimental MoE model from Meta’s LLaMA 4 research series. It pushes the boundary of sparse expert design by increasing the number of experts to 128 while maintaining high performance with minimal per-token compute.
Key features:
- 🧪 High-Sparsity MoE: Only 2 of 128 experts are active per token
- 🧠 Scalable Design: Explores large-scale routing and activation for efficient scaling
- 🔍 Research Preview: Released to investigate inference dynamics of ultra-sparse models
⚙️ Technical Details
- Model Type: Decoder-only transformer with MoE layers
- Experts: 128 total, Top-2 routing per token
- Active Parameters per Token: ~4.7B (2 experts only)
- Tokenizer: LLaMA tokenizer family
- Training: Focused on routing diversity, sparsity effects, and compute efficiency
🚀 Deployment
- Model Card: LLaMA 4 Maverick on Hugging Face
- Tools: Requires custom inference logic (MoE support in PyTorch or DeepSpeed)
- Use Cases: Sparse model benchmarking, MoE routing strategy experiments, LLM scaling research