VoiceCraft is an advanced neural codec language model designed for speech editing and zero-shot text-to-speech (TTS) applications. It achieves state-of-the-art performance on diverse, real-world audio data such as audiobooks, internet videos, and podcasts. With just a few seconds of reference audio, VoiceCraft can clone or edit an unseen voice, making it highly versatile for various use cases.
Key Features:
- High Flexibility: Supports multiple inference methods including Google Colab, Docker, and standalone scripts.
- Enhanced Models: Includes 330M/830M TTS enhanced models for improved performance.
- Ease of Use: Offers Gradio interfaces on HuggingFace Spaces and detailed Colab notebooks for quick testing.
- Training Support: Provides comprehensive guidance for training and fine-tuning custom datasets.
Applications:
- Speech Editing: Modify existing speech recordings with precision.
- Zero-shot TTS: Generate natural-sounding speech from text without prior training on the target voice.
- Long TTS Mode: Efficiently handle long texts for TTS applications.
Technical Highlights:
- Utilizes Encodec for audio encoding and phonemization for text processing.
- Supports custom datasets with detailed steps for data preparation and model training.
- Compatible with CUDA-enabled GPUs for accelerated performance.
Licensing:
- Codebase: CC BY-NC-SA 4.0
- Model Weights: Coqui Public Model License 1.0.0
VoiceCraft is a powerful tool for developers and researchers working on speech synthesis and editing, offering cutting-edge performance with user-friendly interfaces.