Chat Engine: rag_inference.py
🗣️ Overview
This is the interface you actually talk to. rag_inference.py takes your question, searches the database for relevant facts, and then “augments” the AI’s memory with those facts so it can give you a grounded answer.
⚡ The RAG Loop (Retrieval → Augmentation)
1. Retrieval (Finding the Facts)
When you ask a question, the system converts your text into a vector and searches the FAISS Index. It finds the Top 4 most relevant Child chunks and retrieves their Parents.
2. Augmentation (Giving the AI the Facts)
The system then builds a specialized prompt for the AI. It looks like this:
“You are a helpful assistant. Use the following pieces of context to answer the question.
[Context 1: Full text of Parent 1] [Context 2: Full text of Parent 2] …
User Question: How do I reset my password?”
By “pasting” the facts directly into the chat, we ensure the AI knows exactly what to say without guessing.
3. Generation (The Answer)
The LLM (Llama-3/Mistral) reads the facts and types out the answer token-by-token.
🧠 Smart Memory Management (The Swap)
Running two AI models (Embedding + Chat) at the same time can crash a standard computer. RAGv2 uses a “Swap” technique:
- Load Embedding model → Create question vector.
- Unload Embedding model (frees up VRAM).
- Load Chat model → Generate answer.
This allows RAGv2 to run on older GPUs with as little as 6GB or 8GB of VRAM!
🎛️ Personality Settings
- Temperature (0.1): We keep this low so the AI stays focused on the facts (Precision) rather than being creative (Hallucination).
- Stream (True): This prints the answer as it’s being thought of, so you don’t have to wait for the whole block to finish.
Usage Examples
python rag_inference.pyRelated Components
Last Updated: 2026-05-01