Provider: ByteDance
License: Apache 2.0 (fully open and commercially usable)
Access: Open weights on Hugging Face and GitHub
Architecture: Multi-stage TTS pipeline (text β†’ phoneme β†’ acoustic β†’ waveform)


πŸ” Overview

MegaTTS 3 is the third generation of ByteDance’s open-source multilingual text-to-speech (TTS) model. It significantly improves over its predecessors in terms of naturalness, cross-lingual fidelity, and emotion-aware voice synthesis.

Key highlights:

  • 🌐 Multilingual Support: Handles speech generation across multiple languages, including code-switching
  • πŸ˜ƒ Emotion and Style Control: Captures prosody, speaker traits, and emotional tone from reference audio or prompts
  • πŸ—£οΈ Cross-lingual Voice Cloning: Maintains speaker identity even when switching languages

βš™οΈ Technical Specs

  • Architecture: Modular TTS pipeline (MegaTTS encoder + FastSpeech2 + HiFi-GAN)
  • Input: Text or phonemes with optional prosody references
  • Output: 24kHz waveform audio
  • Training Data: Mixture of multilingual and expressive speech corpora
  • Customization: Speaker embedding, prosody token, multilingual code-switching interface

πŸš€ Deployment


πŸ”— Resources