Provider: Meta AI License: Apache 2.0 (permissive open-source license) Access: Open weights available on Hugging Face Architecture: Self-supervised Transformer-based speech encoder Training Data: 960 hours of LibriSpeech audio
๐ Overview
Wav2Vec2 Large 960h is one of the most influential speech foundation models released by Meta AI. It learns high-quality audio representations using self-supervised learning, allowing models to be trained on raw audio without manual transcription.
The model is commonly fine-tuned for automatic speech recognition (ASR) and forms the backbone of many modern open-source speech systems.
Key strengths:
- ๐ง Self-supervised training on raw waveform audio
- ๐ง Strong ASR performance with limited labeled data
- โก Reusable audio embeddings for downstream tasks
โ๏ธ Technical Specs
- Architecture: Transformer encoder
- Input: Raw waveform audio (16 kHz)
- Training Dataset: LibriSpeech 960h
- Pretraining Method: Contrastive predictive coding
- Output: Speech embeddings or transcribed text after fine-tuning
๐ Deployment
- Hugging Face Repo: https://huggingface.co/facebook/wav2vec2-large-960h
- Frameworks: ๐ค Transformers, PyTorch, ONNX
- Use Cases: speech recognition, audio feature extraction, speech analytics
- Hardware: GPU recommended for training; CPU feasible for inference