Amphion

The toolkit for Audio, Music, and Speech Generation

2024-11-21

Amphion is an advanced toolkit tailored for audio, music, and speech generation, aimed at fostering reproducible research and aiding junior researchers and engineers in entering the field. One of its standout features is the visualization of classic models and architectures, which serves as an educational resource to enhance understanding of complex systems.

Key Features

  • Support for Multiple Generation Tasks:
    • Text-to-Speech (TTS): Supported models include FastSpeech2, VITS, VALL-E, NaturalSpeech2, and more.
    • Voice Conversion (VC): Features models like Vevo, FACodec, and Noro for zero-shot conversion.
    • Singing Voice Synthesis (SVS) & Conversion (SVC): Developing and supported models for high-quality singing voice applications.
    • Text-to-Audio (TTA) & Text-to-Music (TTM): Includes latent diffusion models for generating audio and music from text.
  • Vocoders & Evaluation Metrics: Provides various neural vocoders and comprehensive metrics for evaluating generated audio quality.
  • Large-Scale Datasets: Supports datasets like Emilia-Large (over 200k hours) and preprocessing pipelines for in-the-wild data.
  • Visualization Tools: Includes SingVisio for illustrating diffusion models in singing voice conversion.

Recent Updates

  • Vevo1.5: Unified and controllable generation for speech and singing voice (2025).
  • Metis: A foundation model for unified speech generation (2025).
  • Emilia-Large Dataset: A massive 200k-hour dataset combining Emilia and Emilia-YODAS (2025).

Installation & Usage

Amphion can be installed via Conda or Docker, with detailed recipes provided for tasks like TTS, VC, and SVC. The toolkit is open-source under the MIT License, making it free for both research and commercial use.

Contributions & Citations

Amphion welcomes contributions and has been cited in multiple research papers, including its technical reports and conference acceptances like ICLR 2025 and IEEE SLT 2024.

Audio Generation Speech Synthesis Voice Conversion Text-to-Speech Music Generation