IndexTTS2 – Emotional Zero-Shot Text-to-Speech with Duration Control

Core Capabilities

⏱️

Duration Control

Token-level timing or global ratio scaling (0.75×–1.25×). Perfect for dubbing, lip-sync, and audiovisual alignment.

😊

Emotion Disentanglement

Separate emotion from speaker identity. Clone a voice and apply happy, sad, or angry tones independently.

🎤

Zero-Shot Cloning

Clone any voice from a single 5–15 second audio sample. No fine-tuning or training data required.

⚡

Streaming & Batch

Real-time streaming for interactive apps. Batch mode for processing large scripts and audiobook chapters.

🔧

Diffusers Integration

Load and run via Hugging Face Diffusers. Pre-built pipelines for common inference scenarios.

🌐

Code-Switching

Mix English and Chinese within a single utterance. Natural switching without artifacts or pauses.

Understanding IndexTTS2: Architecture & Usage

What Makes IndexTTS2 Different

Most text-to-speech systems treat speech as a sequence of acoustic features predicted frame by frame. IndexTTS2 takes a different approach: it uses an autoregressive language model to predict discrete speech tokens, then decodes them to waveform through a vocoder. This architecture—similar to how large language models generate text—enables emergent capabilities like emotional expression and natural prosody that frame-level models struggle with.

The key innovation is the separation of duration, emotion, and speaker identity into independent control signals. Previous systems like VITS or Bark entangle these factors, making it hard to change one without affecting the others. IndexTTS2's disentangled design lets you keep a speaker's exact timbre while changing their emotional tone from neutral to excited, or slow down speech for emphasis without altering pitch.

Duration Control: How It Works

IndexTTS2 supports two modes of duration control:

Token-level specification — assign explicit durations (in milliseconds) to individual phonemes or words. This is ideal for dubbing where you need speech to match an existing video timeline.
Ratio scaling — apply a global multiplier (0.75× to 1.25×) to speed up or slow down the entire utterance proportionally. Useful for podcast tempo adjustment.

Practical Applications

Film and video dubbing is the primary use case—aligning generated speech to lip movements requires sub-100ms timing accuracy, which IndexTTS2 provides. Audiobook producers use it to generate consistent narrator voices across chapters without re-recording. Game studios add character dialogue by cloning a single reference performance and varying emotion per line. Accessibility teams generate audio descriptions for visually impaired users.

Getting Started

IndexTTS2 is available on GitHub and Hugging Face. Installation requires Python 3.10+ and a CUDA-capable GPU. The minimal setup is:

Install the package: pip install indextts2
Download model weights from Hugging Face Hub
Run inference with a 10-second reference audio and your text input
Adjust duration ratio and emotion tags as needed

The Gradio demo on Hugging Face Spaces lets you try it without any local setup. For production deployment, a Docker container with NVIDIA runtime is the recommended approach.

Evaluation & Benchmarks

On the LibriSpeech test set, IndexTTS2 achieves a word error rate (WER) of 2.1% and speaker similarity (SECS) of 0.87. On emotion recognition benchmarks, it scores 78% accuracy for 7-class emotion classification—competitive with ground-truth human recordings at 82%. Duration control accuracy is within ±30ms of the specified target across utterances.

Frequently Asked Questions

What languages does IndexTTS2 support?

English and Mandarin Chinese. The model handles code-switching between both languages within a single utterance.

Can I use IndexTTS2 commercially?

Yes. IndexTTS2 is released under the Apache 2.0 license, which permits commercial use without royalties.

How does duration control work?

You can specify duration at the token level or set a global ratio (0.75×–1.25×). This enables precise lip-sync for dubbing scenarios.

What hardware is needed to run IndexTTS2?

A GPU with at least 8 GB VRAM (e.g., RTX 3060) is recommended for real-time inference. CPU inference is possible but significantly slower.

How does emotion control differ from style transfer?

IndexTTS2 disentangles emotion from speaker identity. You can clone a speaker's voice and independently control emotion (happy, sad, angry) without changing the voice timbre.

Emotional Zero-Shot TTS with Duration Control