Markdown-to-Audio Pipeline

Overview

The conversion process follows a multi-stage pipeline designed to optimize both quality and performance on consumer GPUs like the RTX 3050.

Pipeline Stages

1. Input Processing

File Reading: Load Markdown file with UTF-8 encoding
Text Cleaning: Apply clean_markdown() function to:
- Convert [text](url) links to plain text
- Remove markdown syntax (*_~#`)
- Strip HTML tags
- (Custom voice only) Remove non-ASCII characters (emojis, special chars)

2. Text Chunking

Long documents are split into manageable chunks to:

Stay within model context limits
Enable parallel processing
Prevent VRAM overflow

Two chunking strategies exist:

Base Script: Default 400 chars, falls back to 200 if not overridden
Custom Script: Optimized 150 chars with comma-aware splitting

3. Voice Prompt Preparation (Optional)

For voice cloning (md_to_audio_base.py):

Process reference audio and text
Generate voice clone prompt using model.create_voice_clone_prompt()
Cache the prompt for reuse across all chunks

4. Adaptive Token Budgeting (Custom Script Only)

Before each GPU call, calculate optimal max_new_tokens:

def estimate_max_tokens(chunk: str, chars_per_token: float = 4.5, audio_tokens_per_text_token: float = 12.0, safety_margin: float = 1.3) -> int:
    approx_text_tokens = len(chunk) / chars_per_token
    approx_audio_tokens = approx_text_tokens * audio_tokens_per_text_token * safety_margin
    return int(min(max(approx_audio_tokens, 256), 1024))

This prevents wasting computation on short chunks by dynamically adjusting the generation budget.

5. GPU Inference

Process batches of chunks on the GPU:

Base Script: Uses fixed --batch_size (default 4)
Custom Script: Designed for --batch-size 1 on 6GB VRAM cards
Utilizes SDPA (Scaled Dot Product Attention) for optimized attention computation

Applies TF32 optimizations for Ampere architecture (RTX 30xx):

torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

6. Audio Assembly

Collect generated audio waveforms from all chunks
Concatenate waveforms using NumPy
Write final audio file as WAV at 24kHz sample rate

Model-Specific Differences

Voice Cloning (`md_to_audio_base.py`)

Model: Qwen3-TTS-12Hz-0.6B-Base
Features: Voice cloning from reference audio
Batch Size: Optimized for higher batch sizes (4+)
Chunk Size: Larger default (400 chars, adaptive to 200)
Voice Control: Reference audio determines speaker characteristics

Preset Voices (`md_to_audio_custom.py`)

Model: Qwen3-TTS-12Hz-0.6B-CustomVoice
Features: Built-in voices with style instructions
Voice Selection: Choose from Serena, Vivian, Uncle_Fu, Ryan, Aiden, Ono_Anna, Sohee, Eric, Dylan
Style Control: --instruct parameter (e.g., “Speak naturally.”, “Excited tone”)
Optimizations: Aggressive settings for 6GB VRAM cards:
- Batch size = 1
- Max chars = 150
- bfloat16 dtype preferred
- Per-chunk adaptive token budgeting

Performance Characteristics

Latency: ~15-30 seconds per chunk on RTX 3050 (varies with chunk size and batch settings)
Throughput: Improved with batching, but limited by VRAM
Quality: 24kHz output, natural-sounding speech
Scalability: Linear with chunk count; parallelizable across chunks

Integration Points

The pipeline is designed for easy integration:

Input: Plain text or Markdown file path
Parameters: Model path, voice/speaker selection, language, batch settings
Output: WAV file path with generated audio
Error handling: Graceful degradation with sequential fallback on batch failures

See 05-cli-reference for detailed parameter documentation.

ProjectBreakdown-101

Explorer

3 Markdown-to-Audio Pipeline

Markdown-to-Audio Pipeline

Overview

Pipeline Stages

1. Input Processing

2. Text Chunking

3. Voice Prompt Preparation (Optional)

4. Adaptive Token Budgeting (Custom Script Only)

5. GPU Inference

6. Audio Assembly

Model-Specific Differences

Voice Cloning (`md_to_audio_base.py`)

Preset Voices (`md_to_audio_custom.py`)

Performance Characteristics

Integration Points

Graph View

Table of Contents

Backlinks

ProjectBreakdown-101

Explorer

3 Markdown-to-Audio Pipeline

Markdown-to-Audio Pipeline

Overview

Pipeline Stages

1. Input Processing

2. Text Chunking

3. Voice Prompt Preparation (Optional)

4. Adaptive Token Budgeting (Custom Script Only)

5. GPU Inference

6. Audio Assembly

Model-Specific Differences

Voice Cloning (md_to_audio_base.py)

Preset Voices (md_to_audio_custom.py)

Performance Characteristics

Integration Points

Graph View

Table of Contents

Backlinks

Voice Cloning (`md_to_audio_base.py`)

Preset Voices (`md_to_audio_custom.py`)