Model Information
Overview
This project uses Alibaba’s Qwen3-TTS series, specifically the 0.6B parameter variants. Two different model types are provided for different use cases.
Model Variants
1. Qwen3-TTS-12Hz-0.6B-Base
- Purpose: Voice cloning from reference audio
- Parameters: 600 million
- Sample Rate: 12kHz (upsampled to 24kHz in output)
- Capabilities:
- Voice cloning with as little as 3 seconds of reference audio
- Multi-lingual support (English, Chinese, etc.)
- Controllable via reference audio characteristics
- Folder:
Qwen3-TTS-12Hz-0.6B-Base/ - Use When: You have a specific voice you want to replicate
2. Qwen3-TTS-12Hz-0.6B-CustomVoice
- Purpose: Preset high-quality voices with style control
- Parameters: 600 million
- Sample Rate: 12kHz (upsampled to 24kHz in output)
- Capabilities:
- 9 built-in professional voices
- Style control via text instructions
- Consistent voice quality without reference audio needed
- Faster inference than cloning (no reference processing)
- Folder:
Qwen3-TTS-12Hz-0.6B-CustomVoice/ - Use When: You want reliable, high-quality voices without managing reference files
Architecture Details
Both models share the same underlying architecture:
- Transformer-based encoder-decoder design
- Text tokenizer optimized for multi-lingual TTS
- Audio decoder producing discrete audio tokens at 12kHz
- Post-processing upsamples to 24kHz for final WAV output
Key Technical Specs
| Specification | Value |
|---|---|
| Model Type | Encoder-Decoder Transformer |
| Text Vocabulary Size | ~150k tokens (BPE) |
| Audio Codec | Neural audio tokens (12kHz equiv.) |
| Upsampling | 12kHz → 24kHz linear interpolation |
| Context Length | Limited by chunking strategy (see pipeline) |
| Attention Mechanism | Supports SDPA, Eager (Flash Attention not required) |
Hugging Face References
- Base Model: https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base
- CustomVoice Model: https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice
Model Files
Each model folder contains:
config.json- Model configuration and architecture detailsgeneration_config.json- Default generation parameterspreprocessor_config.json- Text preprocessing settingstokenizer_config.json+vocab.json+merges.txt- BPE tokenizermodel.safetensors- Model weights in SafeTensors formatprocessor.py- Hugging Face processor wrapperconfiguration.py- Model configuration class
Why These Models?
- Size Efficiency: 600M parameters fits comfortably in 6GB VRAM with batching
- Quality: State-of-the-art naturalness for open-source TTS
- Flexibility: Supports both cloning and preset voice approaches
- Local First: All processing happens offline, ensuring privacy
- Community Support: Active Hugging Face community and documentation
Loading Notes
The custom script implements robust model loading with fallback attention implementations:
- Attempts
sdpa(Scaled Dot Product Attention) first - fastest on Windows - Falls back to
eagerif SDPA fails or unavailable - Uses
low_cpu_mem_usage=Trueto reduce RAM pressure during loading - Supports multiple data types (
bfloat16,float16,float32) for hardware compatibility
Voice Characteristics
CustomVoice Speakers
Each preset voice has distinct characteristics:
- Serena: Female, warm, neutral accent
- Vivian: Female, energetic, clear articulation
- Uncle_Fu: Male, mature, calm tone
- Ryan: Male, youthful, conversational
- Aiden: Male, friendly, medium pace
- Ono_Anna: Female, soft, gentle tone
- Sohee: Female, crisp, professional
- Eric: Male, authoritative, clear
- Dylan: Male, relaxed, natural flow
Cloning Quality Factors
Voice cloning quality depends on:
- Reference Audio Length: Minimum 3 seconds, optimal 10+ seconds
- Audio Clarity: Low background noise, clear speech
- Speaker Consistency: Same voice throughout reference
- Recording Quality: 16kHz+ sample rate recommended
- Text Coverage: Reference text should match audio exactly
Updates and Versions
Check the Hugging Face model cards for:
- Latest version information
- Known limitations
- Community feedback and examples
- Citation information for academic use
Integration with Other Tools
These models can be used with:
- Hugging Face Transformers library directly
- LangChain for LLM-TTS chains
- Custom Python applications via the provided scripts
- Web APIs (though this suite focuses on local/offline use)
For the most current information, always refer to the official model pages on Hugging Face. See 08-directory-structure for complete file organization.