Data Pipeline: rag_ingest.py
📖 Overview
Think of rag_ingest.py as the Prep Kitchen for your AI. Before you can ask questions, the AI needs to “read” and “memorize” your documents. This script scans your files, cuts them into manageable pieces, and translates them into a mathematical language (vectors) that the search engine can understand.
🔄 The Ingestion Pipeline
1. 📂 Step 1: Scanning (The Eye)
The system looks into your docs/ folder. It supports:
- PDFs: The most common format. We use
pypdfto rip the text out. - Markdown/Text: Read directly as raw text.
2. ✂️ Step 2: Chunking (The Knife)
We split the text into Parents (2000 chars) and Children (250 chars).
- Why? Because models have a “Context Window” (a limit on how much they can read at once). By chunking, we ensure we only send the most relevant parts of your documents to the AI, saving memory and time.
3. 🔢 Step 3: Embedding (The Translator)
This is where the magic happens. We use an Embedding Model (like Qwen3-Embedding) to turn each Child chunk into a list of 1024 numbers.
- Batching: We process 512 chunks at once. This is like cooking 10 pizzas in one oven instead of one by one—it’s much faster!
4. 💾 Step 4: Storage (The Freezer)
The results are saved into three files in the data/ folder:
vector_index.faiss: The searchable math index.doc_store.pkl: The actual text of the Parents.child_nodes.pkl: The map linking Children to Parents.
⚙️ Configuration
| Parameter | Default | Why it matters |
|---|---|---|
n_gpu_layers | -1 | -1 uses your GPU for 10x faster processing. 0 uses only your CPU. |
n_batch | 512 | Higher numbers speed up embedding but use more VRAM (GPU memory). |
Usage Examples
python rag_ingest.pyRelated Components
Last Updated: 2026-05-01