Emotional Zero-Shot TTS with Duration Control

IndexTTS2 is an open-source autoregressive text-to-speech system with precise timing control and emotion–timbre disentanglement.

Voice synthesis and audio waveform
Zero-ShotClone from 1 sample
EN + ZHBilingual support
Apache 2.0Commercial use OK

Core Capabilities

⏱️

Duration Control

Token-level timing or global ratio scaling (0.75×–1.25×). Perfect for dubbing, lip-sync, and audiovisual alignment.

😊

Emotion Disentanglement

Separate emotion from speaker identity. Clone a voice and apply happy, sad, or angry tones independently.

🎤

Zero-Shot Cloning

Clone any voice from a single 5–15 second audio sample. No fine-tuning or training data required.

Streaming & Batch

Real-time streaming for interactive apps. Batch mode for processing large scripts and audiobook chapters.

🔧

Diffusers Integration

Load and run via Hugging Face Diffusers. Pre-built pipelines for common inference scenarios.

🌐

Code-Switching

Mix English and Chinese within a single utterance. Natural switching without artifacts or pauses.

Understanding IndexTTS2: Architecture & Usage

What Makes IndexTTS2 Different

Most text-to-speech systems treat speech as a sequence of acoustic features predicted frame by frame. IndexTTS2 takes a different approach: it uses an autoregressive language model to predict discrete speech tokens, then decodes them to waveform through a vocoder. This architecture—similar to how large language models generate text—enables emergent capabilities like emotional expression and natural prosody that frame-level models struggle with.

The key innovation is the separation of duration, emotion, and speaker identity into independent control signals. Previous systems like VITS or Bark entangle these factors, making it hard to change one without affecting the others. IndexTTS2's disentangled design lets you keep a speaker's exact timbre while changing their emotional tone from neutral to excited, or slow down speech for emphasis without altering pitch.

Duration Control: How It Works

IndexTTS2 supports two modes of duration control:

Practical Applications

Film and video dubbing is the primary use case—aligning generated speech to lip movements requires sub-100ms timing accuracy, which IndexTTS2 provides. Audiobook producers use it to generate consistent narrator voices across chapters without re-recording. Game studios add character dialogue by cloning a single reference performance and varying emotion per line. Accessibility teams generate audio descriptions for visually impaired users.

Getting Started

IndexTTS2 is available on GitHub and Hugging Face. Installation requires Python 3.10+ and a CUDA-capable GPU. The minimal setup is:

The Gradio demo on Hugging Face Spaces lets you try it without any local setup. For production deployment, a Docker container with NVIDIA runtime is the recommended approach.

Evaluation & Benchmarks

On the LibriSpeech test set, IndexTTS2 achieves a word error rate (WER) of 2.1% and speaker similarity (SECS) of 0.87. On emotion recognition benchmarks, it scores 78% accuracy for 7-class emotion classification—competitive with ground-truth human recordings at 82%. Duration control accuracy is within ±30ms of the specified target across utterances.

Who Uses IndexTTS2

Podcast recording studio
  • Dubbing studios — lip-synced voice replacement for multi-language content
  • Audiobook producers — consistent narration across long-form content
  • Game developers — character voices with emotional variation per dialogue line
  • App builders — voice interfaces and IVR systems with natural prosody
  • Accessibility teams — generating audio descriptions for visually impaired users

Frequently Asked Questions

What languages does IndexTTS2 support?

English and Mandarin Chinese. The model handles code-switching between both languages within a single utterance.

Can I use IndexTTS2 commercially?

Yes. IndexTTS2 is released under the Apache 2.0 license, which permits commercial use without royalties.

How does duration control work?

You can specify duration at the token level or set a global ratio (0.75×–1.25×). This enables precise lip-sync for dubbing scenarios.

What hardware is needed to run IndexTTS2?

A GPU with at least 8 GB VRAM (e.g., RTX 3060) is recommended for real-time inference. CPU inference is possible but significantly slower.

How does emotion control differ from style transfer?

IndexTTS2 disentangles emotion from speaker identity. You can clone a speaker's voice and independently control emotion (happy, sad, angry) without changing the voice timbre.

About IndexTTS2

IndexTTS2 is built for creators and developers who need emotionally expressive, timing-controlled speech synthesis. The autoregressive architecture delivers natural prosody that frame-level models cannot match, while the disentangled design gives precise control over duration, emotion, and speaker identity.

The project is open-source under Apache 2.0. Model weights, code, and documentation are available on GitHub and Hugging Face.