Hyperparameters & Configuration

Overview

Configuration for fine-tuning the LLM response creativity, speed, and VRAM limitations. Parameters are predominantly found in rag_inference.py.

Configuration

LLM Generation Settings

Parameter	Type	Default	Description
`max_tokens`	int	`None`	Output limit. `None` means no arbitrary cut-off.
`temperature`	float	`0.1`	Creativity control. Lower is more factual (good for RAG).
`top_p`	float	`0.9`	Uses the top 90% most likely words.
`repeat_penalty`	float	`1.1`	Prevents the model from repeating phrases.
`stream`	bool	`True`	Streams text matrix-style to stdout.

Model Loading Settings (llama_cpp)

Parameter	Type	Default	Description
`n_gpu_layers`	int	`-1`	Model offloading to GPU. `-1` = Full GPU.
`n_ctx`	int	`32768`	Context window size for feeding Parent chunks.
`low_vram`	bool	`True`	Optimizes memory footprint.

Last Updated: 2026-05-01