Hyperparameters & Configuration
Overview
Configuration for fine-tuning the LLM response creativity, speed, and VRAM limitations. Parameters are predominantly found in rag_inference.py.
Configuration
LLM Generation Settings
| Parameter | Type | Default | Description |
|---|---|---|---|
max_tokens | int | None | Output limit. None means no arbitrary cut-off. |
temperature | float | 0.1 | Creativity control. Lower is more factual (good for RAG). |
top_p | float | 0.9 | Uses the top 90% most likely words. |
repeat_penalty | float | 1.1 | Prevents the model from repeating phrases. |
stream | bool | True | Streams text matrix-style to stdout. |
Model Loading Settings (llama_cpp)
| Parameter | Type | Default | Description |
|---|---|---|---|
n_gpu_layers | int | -1 | Model offloading to GPU. -1 = Full GPU. |
n_ctx | int | 32768 | Context window size for feeding Parent chunks. |
low_vram | bool | True | Optimizes memory footprint. |
Related Components
Last Updated: 2026-05-01