CLI Reference
Common Arguments
Both scripts share several core arguments for specifying input/output and model location.
| Argument | Description | Default |
|---|
input_md | Path to the input Markdown file | (Required) |
--output | Path to save the output WAV audio file | output.wav (base) / output_custom.wav (custom) |
--model_path | Path to the local model weights folder | ./Qwen3-TTS-12Hz-0.6B-Base (base) / ./Qwen3-TTS-12Hz-0.6B-CustomVoice (custom) |
Processing Parameters
| Argument | Description | Default |
|---|
--chunk_size / --max-chars | Maximum characters per chunk (see note below) | 400 (base) / 150 (custom) |
--batch_size | Number of chunks to process in parallel on GPU | 4 (base) / 1 (custom) |
--lang | Language for generation (affects tokenizer and prosody) | English |
Voice Cloning Script (md_to_audio_base.py)
Unique Arguments
| Argument | Description | Default |
|---|
--ref_audio | Path to reference audio file for voice cloning | (Required for cloning) |
--ref_text | Transcript of the reference audio (must match audio content) | (Required for cloning) |
Usage Example
python md_to_audio_base.py document.md --ref_audio "my_voice.wav" --ref_text "This is my voice." --output "cloned_audio.wav"
Preset Voices Script (md_to_audio_custom.py)
Unique Arguments
| Argument | Description | Default |
|---|
--speaker | Speaker name (case-insensitive) | Aiden |
--instruct | Style instruction for the model | Speak naturally. |
--dtype | Model data type (bfloat16, float16, float32) | bfloat16 |
Available Speakers
- Serena
- Vivian
- Uncle_Fu
- Ryan
- Aiden
- Ono_Anna
- Sohee
- Eric
- Dylan
Usage Examples
# Basic usage with default settings
python md_to_audio_custom.py document.md
# Specify speaker and style
python md_to_audio_custom.py document.md --speaker "Ryan" --instruct "Excited tone"
# Use float16 instead of bfloat16 (if experiencing numerical issues)
python md_to_audio_custom.py document.md --dtype "float16"
# Increase batch size if you have more than 6GB VRAM
python md_to_audio_custom.py document.md --batch_size 2
Chunk Size Behavior
Base Script (md_to_audio_base.py)
- The
--chunk_size argument sets the target, but the actual chunk size may be reduced to 200 characters if not overridden.
- This adaptive behavior helps prevent VRAM overflow on long sentences.
Custom Script (md_to_audio_custom.py)
- The
--max-chars argument is used as-is with no fallback reduction.
- The value of 150 was empirically determined to balance speed and quality on 6GB VRAM cards.
- Lower values increase overhead; higher values risk OOM errors.
- For maximum throughput on GPUs with >8GB VRAM, increase
--batch_size in the base script.
- For latency-sensitive applications (like real-time preview), keep
--batch_size at 1 and tune --max-chars.
- The custom script’s
--dtype parameter allows trading speed for numerical stability:
bfloat16: Fastest on Ampere (RTX 30xx) and newer
float16: Good balance, slightly slower than bfloat16
float32: Slowest but most numerically stable (not recommended unless necessary)
Error Handling
Both scripts include graceful degradation:
- If a GPU batch fails, the base script falls back to sequential processing
- The custom script will stop on batch failure and report the error (often due to special characters or VRAM issues)
- Common errors and solutions are documented in troubleshooting