Model Information

Overview

This project uses Alibaba’s Qwen3-TTS series, specifically the 0.6B parameter variants. Two different model types are provided for different use cases.

Model Variants

1. Qwen3-TTS-12Hz-0.6B-Base

  • Purpose: Voice cloning from reference audio
  • Parameters: 600 million
  • Sample Rate: 12kHz (upsampled to 24kHz in output)
  • Capabilities:
    • Voice cloning with as little as 3 seconds of reference audio
    • Multi-lingual support (English, Chinese, etc.)
    • Controllable via reference audio characteristics
  • Folder: Qwen3-TTS-12Hz-0.6B-Base/
  • Use When: You have a specific voice you want to replicate

2. Qwen3-TTS-12Hz-0.6B-CustomVoice

  • Purpose: Preset high-quality voices with style control
  • Parameters: 600 million
  • Sample Rate: 12kHz (upsampled to 24kHz in output)
  • Capabilities:
    • 9 built-in professional voices
    • Style control via text instructions
    • Consistent voice quality without reference audio needed
    • Faster inference than cloning (no reference processing)
  • Folder: Qwen3-TTS-12Hz-0.6B-CustomVoice/
  • Use When: You want reliable, high-quality voices without managing reference files

Architecture Details

Both models share the same underlying architecture:

  • Transformer-based encoder-decoder design
  • Text tokenizer optimized for multi-lingual TTS
  • Audio decoder producing discrete audio tokens at 12kHz
  • Post-processing upsamples to 24kHz for final WAV output

Key Technical Specs

SpecificationValue
Model TypeEncoder-Decoder Transformer
Text Vocabulary Size~150k tokens (BPE)
Audio CodecNeural audio tokens (12kHz equiv.)
Upsampling12kHz → 24kHz linear interpolation
Context LengthLimited by chunking strategy (see pipeline)
Attention MechanismSupports SDPA, Eager (Flash Attention not required)

Hugging Face References

Model Files

Each model folder contains:

  • config.json - Model configuration and architecture details
  • generation_config.json - Default generation parameters
  • preprocessor_config.json - Text preprocessing settings
  • tokenizer_config.json + vocab.json + merges.txt - BPE tokenizer
  • model.safetensors - Model weights in SafeTensors format
  • processor.py - Hugging Face processor wrapper
  • configuration.py - Model configuration class

Why These Models?

  1. Size Efficiency: 600M parameters fits comfortably in 6GB VRAM with batching
  2. Quality: State-of-the-art naturalness for open-source TTS
  3. Flexibility: Supports both cloning and preset voice approaches
  4. Local First: All processing happens offline, ensuring privacy
  5. Community Support: Active Hugging Face community and documentation

Loading Notes

The custom script implements robust model loading with fallback attention implementations:

  1. Attempts sdpa (Scaled Dot Product Attention) first - fastest on Windows
  2. Falls back to eager if SDPA fails or unavailable
  3. Uses low_cpu_mem_usage=True to reduce RAM pressure during loading
  4. Supports multiple data types (bfloat16, float16, float32) for hardware compatibility

Voice Characteristics

CustomVoice Speakers

Each preset voice has distinct characteristics:

  • Serena: Female, warm, neutral accent
  • Vivian: Female, energetic, clear articulation
  • Uncle_Fu: Male, mature, calm tone
  • Ryan: Male, youthful, conversational
  • Aiden: Male, friendly, medium pace
  • Ono_Anna: Female, soft, gentle tone
  • Sohee: Female, crisp, professional
  • Eric: Male, authoritative, clear
  • Dylan: Male, relaxed, natural flow

Cloning Quality Factors

Voice cloning quality depends on:

  1. Reference Audio Length: Minimum 3 seconds, optimal 10+ seconds
  2. Audio Clarity: Low background noise, clear speech
  3. Speaker Consistency: Same voice throughout reference
  4. Recording Quality: 16kHz+ sample rate recommended
  5. Text Coverage: Reference text should match audio exactly

Updates and Versions

Check the Hugging Face model cards for:

  • Latest version information
  • Known limitations
  • Community feedback and examples
  • Citation information for academic use

Integration with Other Tools

These models can be used with:

  • Hugging Face Transformers library directly
  • LangChain for LLM-TTS chains
  • Custom Python applications via the provided scripts
  • Web APIs (though this suite focuses on local/offline use)

For the most current information, always refer to the official model pages on Hugging Face. See 08-directory-structure for complete file organization.