IndexTTS2 Seed Control & Voice Cloning Tutorial

What Is Seed Control in TTS?

In text-to-speech systems, a "seed" is a random number that initializes the generation process. Given the same text, model, and seed value, the system produces identical audio every time. This is called deterministic generation and is critical for production workflows where consistency matters—for example, generating voiceovers for video series where the narrator's tone must match across episodes.

IndexTTS2 exposes fine-grained seed control through its API, allowing you to lock down not just the voice characteristics but also prosody patterns, breathing pauses, and emotional inflection.

Basic Seed Usage

Setting a Fixed Seed

The simplest form of seed control sets a single integer that governs all random decisions during inference:

from index_tts2 import IndexTTS2

model = IndexTTS2.load("index-tts2-1.0")
audio = model.synthesize(
    text="Welcome to our channel. Today we explore seed control.",
    seed=42,           # Fixed seed for reproducibility
    speaker="narrator" # Speaker embedding
)
audio.save("output_seed42.wav")

Running this code multiple times produces byte-identical WAV files. Change the seed to 43 and you get a slightly different prosody—slightly different pauses, different emphasis on certain syllables—while the speaker identity remains the same.

Seed Exploration

Finding the right seed for your use case is often an exploration process. Generate multiple versions with different seeds and pick the one that best matches your desired style:

for seed in range(100):
    audio = model.synthesize(text="Hello world.", seed=seed, speaker="narrator")
    audio.save(f"samples/hello_seed{seed}.wav")
# Listen to all 100 samples and pick the best one

Parameter Reference

Parameter	Type	Default	Description
`seed`	int	-1 (random)	Random seed. Set to any non-negative integer for reproducible output.
`temperature`	float	0.7	Controls randomness. Lower = more monotone, Higher = more expressive.
`top_p`	float	0.9	Nucleus sampling threshold. Controls diversity of prosody choices.
`speed`	float	1.0	Speaking speed multiplier. 0.5 = half speed, 2.0 = double speed.
`speaker`	str	"default"	Speaker embedding name or reference audio path.
`emotion`	str	None	Emotion tag: "happy", "sad", "angry", "neutral".

Voice Cloning with Reference Audio

Instead of using a named speaker, you can provide a reference audio file. IndexTTS2 extracts speaker embeddings from the reference and synthesizes new text in that voice:

audio = model.synthesize(
    text="This is my cloned voice speaking new text.",
    speaker="path/to/reference.wav",  # 5-30 seconds of clean speech
    seed=42,
    temperature=0.65
)

For best results, use reference audio that is 10–20 seconds long, recorded in a quiet environment, with consistent volume. The model handles background noise reasonably well, but cleaner input produces more accurate clones.

Batch Generation for Production

For audiobook or podcast production, you often need to generate hundreds of segments with consistent voice. Here's a production-ready batch pipeline:

import json

with open("script.json") as f:
    segments = json.load(f)  # [{"id": 1, "text": "...", "emotion": "neutral"}, ...]

SEED = 42  # Lock seed for entire project

for seg in segments:
    audio = model.synthesize(
        text=seg["text"],
        seed=SEED,
        speaker="narrator_v2",
        emotion=seg.get("emotion", "neutral"),
        temperature=0.6  # Lower temp for narration consistency
    )
    audio.save(f"output/{seg['id']:04d}.wav")
    print(f"Generated segment {seg['id']}")

Advanced: Temperature × Seed Interaction

Temperature and seed interact in important ways. At temperature=0.0, the seed is irrelevant—the model always picks the most likely token. As temperature increases, the seed's influence grows because there are more random decisions to make. For creative applications (poetry reading, dramatic narration), use temperature=0.8-1.0 with seed exploration. For technical narration (documentation, tutorials), use temperature=0.5-0.7 with a fixed seed.

Troubleshooting

Same seed, different output? Check that you're using the same model version. Model updates change weights, which changes output even with identical seeds.
Cloned voice sounds robotic? Increase temperature to 0.75-0.85. Ensure reference audio has natural prosody, not monotone reading.
Audio has clicks or pops? Usually a chunk boundary issue. Increase chunk_overlap parameter or use the --crossfade flag.
Speed sounds unnatural? Stay within 0.8-1.3 range. Extreme values (below 0.5 or above 2.0) produce artifacts.