What Is Seed Control in TTS?
In text-to-speech systems, a "seed" is a random number that initializes the generation process. Given the same text, model, and seed value, the system produces identical audio every time. This is called deterministic generation and is critical for production workflows where consistency matters—for example, generating voiceovers for video series where the narrator's tone must match across episodes.
IndexTTS2 exposes fine-grained seed control through its API, allowing you to lock down not just the voice characteristics but also prosody patterns, breathing pauses, and emotional inflection.
Basic Seed Usage
Setting a Fixed Seed
The simplest form of seed control sets a single integer that governs all random decisions during inference:
from index_tts2 import IndexTTS2
model = IndexTTS2.load("index-tts2-1.0")
audio = model.synthesize(
text="Welcome to our channel. Today we explore seed control.",
seed=42, # Fixed seed for reproducibility
speaker="narrator" # Speaker embedding
)
audio.save("output_seed42.wav")
Running this code multiple times produces byte-identical WAV files. Change the seed to 43 and you get a slightly different prosody—slightly different pauses, different emphasis on certain syllables—while the speaker identity remains the same.
Seed Exploration
Finding the right seed for your use case is often an exploration process. Generate multiple versions with different seeds and pick the one that best matches your desired style:
for seed in range(100):
audio = model.synthesize(text="Hello world.", seed=seed, speaker="narrator")
audio.save(f"samples/hello_seed{seed}.wav")
# Listen to all 100 samples and pick the best one
Parameter Reference
| Parameter | Type | Default | Description |
|---|---|---|---|
seed | int | -1 (random) | Random seed. Set to any non-negative integer for reproducible output. |
temperature | float | 0.7 | Controls randomness. Lower = more monotone, Higher = more expressive. |
top_p | float | 0.9 | Nucleus sampling threshold. Controls diversity of prosody choices. |
speed | float | 1.0 | Speaking speed multiplier. 0.5 = half speed, 2.0 = double speed. |
speaker | str | "default" | Speaker embedding name or reference audio path. |
emotion | str | None | Emotion tag: "happy", "sad", "angry", "neutral". |
Voice Cloning with Reference Audio
Instead of using a named speaker, you can provide a reference audio file. IndexTTS2 extracts speaker embeddings from the reference and synthesizes new text in that voice:
audio = model.synthesize(
text="This is my cloned voice speaking new text.",
speaker="path/to/reference.wav", # 5-30 seconds of clean speech
seed=42,
temperature=0.65
)
For best results, use reference audio that is 10–20 seconds long, recorded in a quiet environment, with consistent volume. The model handles background noise reasonably well, but cleaner input produces more accurate clones.
Batch Generation for Production
For audiobook or podcast production, you often need to generate hundreds of segments with consistent voice. Here's a production-ready batch pipeline:
import json
with open("script.json") as f:
segments = json.load(f) # [{"id": 1, "text": "...", "emotion": "neutral"}, ...]
SEED = 42 # Lock seed for entire project
for seg in segments:
audio = model.synthesize(
text=seg["text"],
seed=SEED,
speaker="narrator_v2",
emotion=seg.get("emotion", "neutral"),
temperature=0.6 # Lower temp for narration consistency
)
audio.save(f"output/{seg['id']:04d}.wav")
print(f"Generated segment {seg['id']}")
Advanced: Temperature × Seed Interaction
Temperature and seed interact in important ways. At temperature=0.0, the seed is irrelevant—the model always picks the most likely token. As temperature increases, the seed's influence grows because there are more random decisions to make. For creative applications (poetry reading, dramatic narration), use temperature=0.8-1.0 with seed exploration. For technical narration (documentation, tutorials), use temperature=0.5-0.7 with a fixed seed.
Troubleshooting
- Same seed, different output? Check that you're using the same model version. Model updates change weights, which changes output even with identical seeds.
- Cloned voice sounds robotic? Increase temperature to 0.75-0.85. Ensure reference audio has natural prosody, not monotone reading.
- Audio has clicks or pops? Usually a chunk boundary issue. Increase
chunk_overlapparameter or use the--crossfadeflag. - Speed sounds unnatural? Stay within 0.8-1.3 range. Extreme values (below 0.5 or above 2.0) produce artifacts.