Chat Engine: rag_inference.py

🗣️ Overview

This is the interface you actually talk to. rag_inference.py takes your question, searches the database for relevant facts, and then “augments” the AI’s memory with those facts so it can give you a grounded answer.

⚡ The RAG Loop (Retrieval → Augmentation)

1. Retrieval (Finding the Facts)

When you ask a question, the system converts your text into a vector and searches the FAISS Index. It finds the Top 4 most relevant Child chunks and retrieves their Parents.

2. Augmentation (Giving the AI the Facts)

The system then builds a specialized prompt for the AI. It looks like this:

“You are a helpful assistant. Use the following pieces of context to answer the question.

[Context 1: Full text of Parent 1] [Context 2: Full text of Parent 2] …

User Question: How do I reset my password?”

By “pasting” the facts directly into the chat, we ensure the AI knows exactly what to say without guessing.

3. Generation (The Answer)

The LLM (Llama-3/Mistral) reads the facts and types out the answer token-by-token.

🧠 Smart Memory Management (The Swap)

Running two AI models (Embedding + Chat) at the same time can crash a standard computer. RAGv2 uses a “Swap” technique:

Load Embedding model → Create question vector.
Unload Embedding model (frees up VRAM).
Load Chat model → Generate answer.

This allows RAGv2 to run on older GPUs with as little as 6GB or 8GB of VRAM!

🎛️ Personality Settings

Temperature (0.1): We keep this low so the AI stays focused on the facts (Precision) rather than being creative (Hallucination).
Stream (True): This prints the answer as it’s being thought of, so you don’t have to wait for the whole block to finish.

Usage Examples

python rag_inference.py

Last Updated: 2026-05-01

ProjectBreakdown-101

Explorer

Chat Engine - rag_inference.py

Chat Engine: rag_inference.py

🗣️ Overview

⚡ The RAG Loop (Retrieval → Augmentation)

1. Retrieval (Finding the Facts)

2. Augmentation (Giving the AI the Facts)

3. Generation (The Answer)

🧠 Smart Memory Management (The Swap)

🎛️ Personality Settings

Usage Examples

Graph View

Table of Contents

Backlinks

ProjectBreakdown-101

Explorer

Chat Engine - rag_inference.py

Chat Engine: rag_inference.py

🗣️ Overview

⚡ The RAG Loop (Retrieval → Augmentation)

1. Retrieval (Finding the Facts)

2. Augmentation (Giving the AI the Facts)

3. Generation (The Answer)

🧠 Smart Memory Management (The Swap)

🎛️ Personality Settings

Usage Examples

Related Components

Graph View

Table of Contents

Backlinks