Chat Engine: rag_inference.py

🗣️ Overview

This is the interface you actually talk to. rag_inference.py takes your question, searches the database for relevant facts, and then “augments” the AI’s memory with those facts so it can give you a grounded answer.


⚡ The RAG Loop (Retrieval Augmentation)

1. Retrieval (Finding the Facts)

When you ask a question, the system converts your text into a vector and searches the FAISS Index. It finds the Top 4 most relevant Child chunks and retrieves their Parents.

2. Augmentation (Giving the AI the Facts)

The system then builds a specialized prompt for the AI. It looks like this:

“You are a helpful assistant. Use the following pieces of context to answer the question.

[Context 1: Full text of Parent 1] [Context 2: Full text of Parent 2] …

User Question: How do I reset my password?”

By “pasting” the facts directly into the chat, we ensure the AI knows exactly what to say without guessing.

3. Generation (The Answer)

The LLM (Llama-3/Mistral) reads the facts and types out the answer token-by-token.


🧠 Smart Memory Management (The Swap)

Running two AI models (Embedding + Chat) at the same time can crash a standard computer. RAGv2 uses a “Swap” technique:

  1. Load Embedding model Create question vector.
  2. Unload Embedding model (frees up VRAM).
  3. Load Chat model Generate answer.

This allows RAGv2 to run on older GPUs with as little as 6GB or 8GB of VRAM!


🎛️ Personality Settings

  • Temperature (0.1): We keep this low so the AI stays focused on the facts (Precision) rather than being creative (Hallucination).
  • Stream (True): This prints the answer as it’s being thought of, so you don’t have to wait for the whole block to finish.

Usage Examples

python rag_inference.py

Last Updated: 2026-05-01