Provider: Meta AI License: Apache 2.0 (permissive open-source license) Access: Open weights available on Hugging Face Architecture: Self-supervised Transformer-based speech encoder Training Data: 960 hours of LibriSpeech audio


๐Ÿ” Overview

Wav2Vec2 Large 960h is one of the most influential speech foundation models released by Meta AI. It learns high-quality audio representations using self-supervised learning, allowing models to be trained on raw audio without manual transcription.

The model is commonly fine-tuned for automatic speech recognition (ASR) and forms the backbone of many modern open-source speech systems.

Key strengths:

  • ๐ŸŽง Self-supervised training on raw waveform audio
  • ๐Ÿง  Strong ASR performance with limited labeled data
  • โšก Reusable audio embeddings for downstream tasks

โš™๏ธ Technical Specs

  • Architecture: Transformer encoder
  • Input: Raw waveform audio (16 kHz)
  • Training Dataset: LibriSpeech 960h
  • Pretraining Method: Contrastive predictive coding
  • Output: Speech embeddings or transcribed text after fine-tuning

๐Ÿš€ Deployment

  • Hugging Face Repo: https://huggingface.co/facebook/wav2vec2-large-960h
  • Frameworks: ๐Ÿค— Transformers, PyTorch, ONNX
  • Use Cases: speech recognition, audio feature extraction, speech analytics
  • Hardware: GPU recommended for training; CPU feasible for inference

๐Ÿ”— Resources