Data Pipeline: rag_ingest.py

📖 Overview

Think of rag_ingest.py as the Prep Kitchen for your AI. Before you can ask questions, the AI needs to “read” and “memorize” your documents. This script scans your files, cuts them into manageable pieces, and translates them into a mathematical language (vectors) that the search engine can understand.

🔄 The Ingestion Pipeline

1. 📂 Step 1: Scanning (The Eye)

The system looks into your docs/ folder. It supports:

PDFs: The most common format. We use pypdf to rip the text out.
Markdown/Text: Read directly as raw text.

2. ✂️ Step 2: Chunking (The Knife)

We split the text into Parents (2000 chars) and Children (250 chars).

Why? Because models have a “Context Window” (a limit on how much they can read at once). By chunking, we ensure we only send the most relevant parts of your documents to the AI, saving memory and time.

3. 🔢 Step 3: Embedding (The Translator)

This is where the magic happens. We use an Embedding Model (like Qwen3-Embedding) to turn each Child chunk into a list of 1024 numbers.

Batching: We process 512 chunks at once. This is like cooking 10 pizzas in one oven instead of one by one—it’s much faster!

4. 💾 Step 4: Storage (The Freezer)

The results are saved into three files in the data/ folder:

vector_index.faiss: The searchable math index.
doc_store.pkl: The actual text of the Parents.
child_nodes.pkl: The map linking Children to Parents.

⚙️ Configuration

Parameter	Default	Why it matters
`n_gpu_layers`	`-1`	-1 uses your GPU for 10x faster processing. 0 uses only your CPU.
`n_batch`	`512`	Higher numbers speed up embedding but use more VRAM (GPU memory).

Usage Examples

python rag_ingest.py

Last Updated: 2026-05-01

ProjectBreakdown-101

Explorer

Data Pipeline - rag_ingest.py

Data Pipeline: rag_ingest.py

📖 Overview

🔄 The Ingestion Pipeline

1. 📂 Step 1: Scanning (The Eye)

2. ✂️ Step 2: Chunking (The Knife)

3. 🔢 Step 3: Embedding (The Translator)

4. 💾 Step 4: Storage (The Freezer)

⚙️ Configuration

Usage Examples

Graph View

Table of Contents

Backlinks

ProjectBreakdown-101

Explorer

Data Pipeline - rag_ingest.py

Data Pipeline: rag_ingest.py

📖 Overview

🔄 The Ingestion Pipeline

1. 📂 Step 1: Scanning (The Eye)

2. ✂️ Step 2: Chunking (The Knife)

3. 🔢 Step 3: Embedding (The Translator)

4. 💾 Step 4: Storage (The Freezer)

⚙️ Configuration

Usage Examples

Related Components

Graph View

Table of Contents

Backlinks