Model Information

Overview

This project uses Alibaba’s Qwen3-TTS series, specifically the 0.6B parameter variants. Two different model types are provided for different use cases.

Model Variants

1. Qwen3-TTS-12Hz-0.6B-Base

Purpose: Voice cloning from reference audio
Parameters: 600 million
Sample Rate: 12kHz (upsampled to 24kHz in output)
Capabilities:
- Voice cloning with as little as 3 seconds of reference audio
- Multi-lingual support (English, Chinese, etc.)
- Controllable via reference audio characteristics
Folder: Qwen3-TTS-12Hz-0.6B-Base/
Use When: You have a specific voice you want to replicate

2. Qwen3-TTS-12Hz-0.6B-CustomVoice

Purpose: Preset high-quality voices with style control
Parameters: 600 million
Sample Rate: 12kHz (upsampled to 24kHz in output)
Capabilities:
- 9 built-in professional voices
- Style control via text instructions
- Consistent voice quality without reference audio needed
- Faster inference than cloning (no reference processing)
Folder: Qwen3-TTS-12Hz-0.6B-CustomVoice/
Use When: You want reliable, high-quality voices without managing reference files

Architecture Details

Both models share the same underlying architecture:

Transformer-based encoder-decoder design
Text tokenizer optimized for multi-lingual TTS
Audio decoder producing discrete audio tokens at 12kHz
Post-processing upsamples to 24kHz for final WAV output

Key Technical Specs

Specification	Value
Model Type	Encoder-Decoder Transformer
Text Vocabulary Size	~150k tokens (BPE)
Audio Codec	Neural audio tokens (12kHz equiv.)
Upsampling	12kHz → 24kHz linear interpolation
Context Length	Limited by chunking strategy (see pipeline)
Attention Mechanism	Supports SDPA, Eager (Flash Attention not required)

Hugging Face References

Base Model: https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base
CustomVoice Model: https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice

Model Files

Each model folder contains:

config.json - Model configuration and architecture details
generation_config.json - Default generation parameters
preprocessor_config.json - Text preprocessing settings
tokenizer_config.json + vocab.json + merges.txt - BPE tokenizer
model.safetensors - Model weights in SafeTensors format
processor.py - Hugging Face processor wrapper
configuration.py - Model configuration class

Why These Models?

Size Efficiency: 600M parameters fits comfortably in 6GB VRAM with batching
Quality: State-of-the-art naturalness for open-source TTS
Flexibility: Supports both cloning and preset voice approaches
Local First: All processing happens offline, ensuring privacy
Community Support: Active Hugging Face community and documentation

Loading Notes

The custom script implements robust model loading with fallback attention implementations:

Attempts sdpa (Scaled Dot Product Attention) first - fastest on Windows
Falls back to eager if SDPA fails or unavailable
Uses low_cpu_mem_usage=True to reduce RAM pressure during loading
Supports multiple data types (bfloat16, float16, float32) for hardware compatibility

Voice Characteristics

CustomVoice Speakers

Each preset voice has distinct characteristics:

Serena: Female, warm, neutral accent
Vivian: Female, energetic, clear articulation
Uncle_Fu: Male, mature, calm tone
Ryan: Male, youthful, conversational
Aiden: Male, friendly, medium pace
Ono_Anna: Female, soft, gentle tone
Sohee: Female, crisp, professional
Eric: Male, authoritative, clear
Dylan: Male, relaxed, natural flow

Cloning Quality Factors

Voice cloning quality depends on:

Reference Audio Length: Minimum 3 seconds, optimal 10+ seconds
Audio Clarity: Low background noise, clear speech
Speaker Consistency: Same voice throughout reference
Recording Quality: 16kHz+ sample rate recommended
Text Coverage: Reference text should match audio exactly

Updates and Versions

Check the Hugging Face model cards for:

Latest version information
Known limitations
Community feedback and examples
Citation information for academic use

Integration with Other Tools

These models can be used with:

Hugging Face Transformers library directly
LangChain for LLM-TTS chains
Custom Python applications via the provided scripts
Web APIs (though this suite focuses on local/offline use)

For the most current information, always refer to the official model pages on Hugging Face. See 08-directory-structure for complete file organization.

ProjectBreakdown-101

Explorer

6 Model Information