Hyperparameters & Configuration

Overview

Configuration for fine-tuning the LLM response creativity, speed, and VRAM limitations. Parameters are predominantly found in rag_inference.py.

Configuration

LLM Generation Settings

ParameterTypeDefaultDescription
max_tokensintNoneOutput limit. None means no arbitrary cut-off.
temperaturefloat0.1Creativity control. Lower is more factual (good for RAG).
top_pfloat0.9Uses the top 90% most likely words.
repeat_penaltyfloat1.1Prevents the model from repeating phrases.
streamboolTrueStreams text matrix-style to stdout.

Model Loading Settings (llama_cpp)

ParameterTypeDefaultDescription
n_gpu_layersint-1Model offloading to GPU. -1 = Full GPU.
n_ctxint32768Context window size for feeding Parent chunks.
low_vramboolTrueOptimizes memory footprint.

Last Updated: 2026-05-01