Fish-Speech

Fish-Speech is a cutting-edge text-to-speech (TTS) system that leverages large language models to deliver high-quality, multilingual speech synthesis with unique features:

Zero-shot & Few-shot TTS: Generate high-quality TTS output with just a 10-30 second vocal sample, enabling instant voice cloning capabilities.
Multilingual Support: Seamlessly handles multiple languages including English, Japanese, Korean, Chinese, French, German, Arabic, and Spanish without requiring language-specific configurations.
End-to-End Architecture: Unlike traditional three-stage systems (ASR+LLM+TTS), Fish-Speech integrates all components natively for better performance and simplicity.
Emotional Speech: The model can generate speech with emotional inflection, making synthesized voices sound more natural and expressive.
Timbre Control: Users can adjust speech characteristics using reference audio for personalized voice output.

Technical highlights include:

Achieves remarkably low CER/WER (~2% for English)
Fast inference speeds (1:5 real-time factor on RTX 4060)
Multiple deployment options including WebUI (Gradio) and native GUI (PyQt6)
Cross-platform support for Linux, Windows and macOS

The project is research-backed (with a paper on arXiv) and open-sourced under Apache License for code and CC-BY-NC-SA-4.0 for model weights. As an early alpha version, it welcomes community contributions to improve inference speed and fix bugs.

Fish-Speech

Advanced Multilingual Text-to-Speech with Large Language Models

MindsDB

GraphRAG Accelerator