Markdown-to-Audio Pipeline
Overview
The conversion process follows a multi-stage pipeline designed to optimize both quality and performance on consumer GPUs like the RTX 3050.
Pipeline Stages
1. Input Processing
- File Reading: Load Markdown file with UTF-8 encoding
- Text Cleaning: Apply
clean_markdown()function to:- Convert
[text](url)links to plain text - Remove markdown syntax (
*_~#`) - Strip HTML tags
- (Custom voice only) Remove non-ASCII characters (emojis, special chars)
- Convert
2. Text Chunking
Long documents are split into manageable chunks to:
- Stay within model context limits
- Enable parallel processing
- Prevent VRAM overflow
Two chunking strategies exist:
- Base Script: Default 400 chars, falls back to 200 if not overridden
- Custom Script: Optimized 150 chars with comma-aware splitting
3. Voice Prompt Preparation (Optional)
For voice cloning (md_to_audio_base.py):
- Process reference audio and text
- Generate voice clone prompt using
model.create_voice_clone_prompt() - Cache the prompt for reuse across all chunks
4. Adaptive Token Budgeting (Custom Script Only)
Before each GPU call, calculate optimal max_new_tokens:
def estimate_max_tokens(chunk: str, chars_per_token: float = 4.5, audio_tokens_per_text_token: float = 12.0, safety_margin: float = 1.3) -> int:
approx_text_tokens = len(chunk) / chars_per_token
approx_audio_tokens = approx_text_tokens * audio_tokens_per_text_token * safety_margin
return int(min(max(approx_audio_tokens, 256), 1024))This prevents wasting computation on short chunks by dynamically adjusting the generation budget.
5. GPU Inference
Process batches of chunks on the GPU:
- Base Script: Uses fixed
--batch_size(default 4) - Custom Script: Designed for
--batch-size 1on 6GB VRAM cards - Utilizes SDPA (Scaled Dot Product Attention) for optimized attention computation
- Applies TF32 optimizations for Ampere architecture (RTX 30xx):
torch.backends.cuda.matmul.allow_tf32 = True torch.backends.cudnn.allow_tf32 = True
6. Audio Assembly
- Collect generated audio waveforms from all chunks
- Concatenate waveforms using NumPy
- Write final audio file as WAV at 24kHz sample rate
Model-Specific Differences
Voice Cloning (md_to_audio_base.py)
- Model:
Qwen3-TTS-12Hz-0.6B-Base - Features: Voice cloning from reference audio
- Batch Size: Optimized for higher batch sizes (4+)
- Chunk Size: Larger default (400 chars, adaptive to 200)
- Voice Control: Reference audio determines speaker characteristics
Preset Voices (md_to_audio_custom.py)
- Model:
Qwen3-TTS-12Hz-0.6B-CustomVoice - Features: Built-in voices with style instructions
- Voice Selection: Choose from Serena, Vivian, Uncle_Fu, Ryan, Aiden, Ono_Anna, Sohee, Eric, Dylan
- Style Control:
--instructparameter (e.g., “Speak naturally.”, “Excited tone”) - Optimizations: Aggressive settings for 6GB VRAM cards:
- Batch size = 1
- Max chars = 150
- bfloat16 dtype preferred
- Per-chunk adaptive token budgeting
Performance Characteristics
- Latency: ~15-30 seconds per chunk on RTX 3050 (varies with chunk size and batch settings)
- Throughput: Improved with batching, but limited by VRAM
- Quality: 24kHz output, natural-sounding speech
- Scalability: Linear with chunk count; parallelizable across chunks
Integration Points
The pipeline is designed for easy integration:
- Input: Plain text or Markdown file path
- Parameters: Model path, voice/speaker selection, language, batch settings
- Output: WAV file path with generated audio
- Error handling: Graceful degradation with sequential fallback on batch failures
See 05-cli-reference for detailed parameter documentation.