IndexTTS2 is an open-source autoregressive text-to-speech system with precise timing control and emotion–timbre disentanglement.
Token-level timing or global ratio scaling (0.75×–1.25×). Perfect for dubbing, lip-sync, and audiovisual alignment.
Separate emotion from speaker identity. Clone a voice and apply happy, sad, or angry tones independently.
Clone any voice from a single 5–15 second audio sample. No fine-tuning or training data required.
Real-time streaming for interactive apps. Batch mode for processing large scripts and audiobook chapters.
Load and run via Hugging Face Diffusers. Pre-built pipelines for common inference scenarios.
Mix English and Chinese within a single utterance. Natural switching without artifacts or pauses.
Most text-to-speech systems treat speech as a sequence of acoustic features predicted frame by frame. IndexTTS2 takes a different approach: it uses an autoregressive language model to predict discrete speech tokens, then decodes them to waveform through a vocoder. This architecture—similar to how large language models generate text—enables emergent capabilities like emotional expression and natural prosody that frame-level models struggle with.
The key innovation is the separation of duration, emotion, and speaker identity into independent control signals. Previous systems like VITS or Bark entangle these factors, making it hard to change one without affecting the others. IndexTTS2's disentangled design lets you keep a speaker's exact timbre while changing their emotional tone from neutral to excited, or slow down speech for emphasis without altering pitch.
IndexTTS2 supports two modes of duration control:
Film and video dubbing is the primary use case—aligning generated speech to lip movements requires sub-100ms timing accuracy, which IndexTTS2 provides. Audiobook producers use it to generate consistent narrator voices across chapters without re-recording. Game studios add character dialogue by cloning a single reference performance and varying emotion per line. Accessibility teams generate audio descriptions for visually impaired users.
IndexTTS2 is available on GitHub and Hugging Face. Installation requires Python 3.10+ and a CUDA-capable GPU. The minimal setup is:
pip install indextts2The Gradio demo on Hugging Face Spaces lets you try it without any local setup. For production deployment, a Docker container with NVIDIA runtime is the recommended approach.
On the LibriSpeech test set, IndexTTS2 achieves a word error rate (WER) of 2.1% and speaker similarity (SECS) of 0.87. On emotion recognition benchmarks, it scores 78% accuracy for 7-class emotion classification—competitive with ground-truth human recordings at 82%. Duration control accuracy is within ±30ms of the specified target across utterances.
English and Mandarin Chinese. The model handles code-switching between both languages within a single utterance.
Yes. IndexTTS2 is released under the Apache 2.0 license, which permits commercial use without royalties.
You can specify duration at the token level or set a global ratio (0.75×–1.25×). This enables precise lip-sync for dubbing scenarios.
A GPU with at least 8 GB VRAM (e.g., RTX 3060) is recommended for real-time inference. CPU inference is possible but significantly slower.
IndexTTS2 disentangles emotion from speaker identity. You can clone a speaker's voice and independently control emotion (happy, sad, angry) without changing the voice timbre.
IndexTTS2 is built for creators and developers who need emotionally expressive, timing-controlled speech synthesis. The autoregressive architecture delivers natural prosody that frame-level models cannot match, while the disentangled design gives precise control over duration, emotion, and speaker identity.
The project is open-source under Apache 2.0. Model weights, code, and documentation are available on GitHub and Hugging Face.