VoiceCraft

State-of-the-art speech editing and zero-shot text-to-speech in the wild

2024-04-03

VoiceCraft is an advanced neural codec language model designed for speech editing and zero-shot text-to-speech (TTS) applications. It achieves state-of-the-art performance on diverse, real-world audio data such as audiobooks, internet videos, and podcasts. With just a few seconds of reference audio, VoiceCraft can clone or edit an unseen voice, making it highly versatile for various use cases.

Key Features:

  • High Flexibility: Supports multiple inference methods including Google Colab, Docker, and standalone scripts.
  • Enhanced Models: Includes 330M/830M TTS enhanced models for improved performance.
  • Ease of Use: Offers Gradio interfaces on HuggingFace Spaces and detailed Colab notebooks for quick testing.
  • Training Support: Provides comprehensive guidance for training and fine-tuning custom datasets.

Applications:

  • Speech Editing: Modify existing speech recordings with precision.
  • Zero-shot TTS: Generate natural-sounding speech from text without prior training on the target voice.
  • Long TTS Mode: Efficiently handle long texts for TTS applications.

Technical Highlights:

  • Utilizes Encodec for audio encoding and phonemization for text processing.
  • Supports custom datasets with detailed steps for data preparation and model training.
  • Compatible with CUDA-enabled GPUs for accelerated performance.

Licensing:

  • Codebase: CC BY-NC-SA 4.0
  • Model Weights: Coqui Public Model License 1.0.0

VoiceCraft is a powerful tool for developers and researchers working on speech synthesis and editing, offering cutting-edge performance with user-friendly interfaces.

Artificial Intelligence Voice Cloning Text-to-Speech Speech Editing Neural Codec