Markdown-to-Audio Pipeline

Overview

The conversion process follows a multi-stage pipeline designed to optimize both quality and performance on consumer GPUs like the RTX 3050.

Pipeline Stages

1. Input Processing

  • File Reading: Load Markdown file with UTF-8 encoding
  • Text Cleaning: Apply clean_markdown() function to:
    • Convert [text](url) links to plain text
    • Remove markdown syntax (*_~#`)
    • Strip HTML tags
    • (Custom voice only) Remove non-ASCII characters (emojis, special chars)

2. Text Chunking

Long documents are split into manageable chunks to:

  • Stay within model context limits
  • Enable parallel processing
  • Prevent VRAM overflow

Two chunking strategies exist:

  • Base Script: Default 400 chars, falls back to 200 if not overridden
  • Custom Script: Optimized 150 chars with comma-aware splitting

3. Voice Prompt Preparation (Optional)

For voice cloning (md_to_audio_base.py):

  • Process reference audio and text
  • Generate voice clone prompt using model.create_voice_clone_prompt()
  • Cache the prompt for reuse across all chunks

4. Adaptive Token Budgeting (Custom Script Only)

Before each GPU call, calculate optimal max_new_tokens:

def estimate_max_tokens(chunk: str, chars_per_token: float = 4.5, audio_tokens_per_text_token: float = 12.0, safety_margin: float = 1.3) -> int:
    approx_text_tokens = len(chunk) / chars_per_token
    approx_audio_tokens = approx_text_tokens * audio_tokens_per_text_token * safety_margin
    return int(min(max(approx_audio_tokens, 256), 1024))

This prevents wasting computation on short chunks by dynamically adjusting the generation budget.

5. GPU Inference

Process batches of chunks on the GPU:

  • Base Script: Uses fixed --batch_size (default 4)
  • Custom Script: Designed for --batch-size 1 on 6GB VRAM cards
  • Utilizes SDPA (Scaled Dot Product Attention) for optimized attention computation
  • Applies TF32 optimizations for Ampere architecture (RTX 30xx):
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.allow_tf32 = True

6. Audio Assembly

  • Collect generated audio waveforms from all chunks
  • Concatenate waveforms using NumPy
  • Write final audio file as WAV at 24kHz sample rate

Model-Specific Differences

Voice Cloning (md_to_audio_base.py)

  • Model: Qwen3-TTS-12Hz-0.6B-Base
  • Features: Voice cloning from reference audio
  • Batch Size: Optimized for higher batch sizes (4+)
  • Chunk Size: Larger default (400 chars, adaptive to 200)
  • Voice Control: Reference audio determines speaker characteristics

Preset Voices (md_to_audio_custom.py)

  • Model: Qwen3-TTS-12Hz-0.6B-CustomVoice
  • Features: Built-in voices with style instructions
  • Voice Selection: Choose from Serena, Vivian, Uncle_Fu, Ryan, Aiden, Ono_Anna, Sohee, Eric, Dylan
  • Style Control: --instruct parameter (e.g., “Speak naturally.”, “Excited tone”)
  • Optimizations: Aggressive settings for 6GB VRAM cards:
    • Batch size = 1
    • Max chars = 150
    • bfloat16 dtype preferred
    • Per-chunk adaptive token budgeting

Performance Characteristics

  • Latency: ~15-30 seconds per chunk on RTX 3050 (varies with chunk size and batch settings)
  • Throughput: Improved with batching, but limited by VRAM
  • Quality: 24kHz output, natural-sounding speech
  • Scalability: Linear with chunk count; parallelizable across chunks

Integration Points

The pipeline is designed for easy integration:

  • Input: Plain text or Markdown file path
  • Parameters: Model path, voice/speaker selection, language, batch settings
  • Output: WAV file path with generated audio
  • Error handling: Graceful degradation with sequential fallback on batch failures

See 05-cli-reference for detailed parameter documentation.