llama.cpp on Windows
Welcome to the complete guide for setting up and using llama.cpp on a Windows environment. This documentation is designed to take you from zero to running your first Large Language Model (LLM) locally, regardless of whether you are using a CPU, an NVIDIA GPU (CUDA), or an AMD GPU (HIP).
Overview
llama.cpp is a highly optimized C/C++ implementation of the LLaMA model architecture and other LLM architectures. It is designed to run efficiently on consumer hardware by leveraging quantization techniques, allowing large models to fit into smaller amounts of RAM and VRAM.
On Windows, llama.cpp provides a powerful way to:
- Run state-of-the-art LLMs locally.
- Utilize hardware acceleration via NVIDIA CUDA or AMD ROCm/HIP.
- Serve models through a local API server.
- Perform fast inference with minimal overhead.
Quick Start
Depending on your hardware, follow the relevant installation path:
🛠️ Installation
- Standard (CPU): Install for CPU usage
- NVIDIA GPU: Install with CUDA support
- AMD GPU: Install with AMD HIP support
🚀 Running Models
- Prerequisites: Ensure you have the necessary tools installed: Prerequisites.
- Model Formats: Learn about GGUF and how to get models: Model Formats.
- First Inference: Start your first chat: CLI Usage.
Documentation Map
For a full view of how this documentation is organized, see the Directory Structure.
Last Updated: 2026-05-03