Fish-Speech

Advanced Multilingual Text-to-Speech with Large Language Models

2024-07-07

Fish-Speech is a cutting-edge text-to-speech (TTS) system that leverages large language models to deliver high-quality, multilingual speech synthesis with unique features:

  • Zero-shot & Few-shot TTS: Generate high-quality TTS output with just a 10-30 second vocal sample, enabling instant voice cloning capabilities.
  • Multilingual Support: Seamlessly handles multiple languages including English, Japanese, Korean, Chinese, French, German, Arabic, and Spanish without requiring language-specific configurations.
  • End-to-End Architecture: Unlike traditional three-stage systems (ASR+LLM+TTS), Fish-Speech integrates all components natively for better performance and simplicity.
  • Emotional Speech: The model can generate speech with emotional inflection, making synthesized voices sound more natural and expressive.
  • Timbre Control: Users can adjust speech characteristics using reference audio for personalized voice output.

Technical highlights include:

  • Achieves remarkably low CER/WER (~2% for English)
  • Fast inference speeds (1:5 real-time factor on RTX 4060)
  • Multiple deployment options including WebUI (Gradio) and native GUI (PyQt6)
  • Cross-platform support for Linux, Windows and macOS

The project is research-backed (with a paper on arXiv) and open-sourced under Apache License for code and CC-BY-NC-SA-4.0 for model weights. As an early alpha version, it welcomes community contributions to improve inference speed and fix bugs.

Text-to-Speech Voice Cloning Multilingual Support Artificial Intelligence Speech Synthesis