llama.cpp on Windows

Welcome to the complete guide for setting up and using llama.cpp on a Windows environment. This documentation is designed to take you from zero to running your first Large Language Model (LLM) locally, regardless of whether you are using a CPU, an NVIDIA GPU (CUDA), or an AMD GPU (HIP).

Overview

llama.cpp is a highly optimized C/C++ implementation of the LLaMA model architecture and other LLM architectures. It is designed to run efficiently on consumer hardware by leveraging quantization techniques, allowing large models to fit into smaller amounts of RAM and VRAM.

On Windows, llama.cpp provides a powerful way to:

Run state-of-the-art LLMs locally.
Utilize hardware acceleration via NVIDIA CUDA or AMD ROCm/HIP.
Serve models through a local API server.
Perform fast inference with minimal overhead.

Quick Start

Depending on your hardware, follow the relevant installation path:

🛠️ Installation

Standard (CPU): Install for CPU usage
NVIDIA GPU: Install with CUDA support
AMD GPU: Install with AMD HIP support

🚀 Running Models

Prerequisites: Ensure you have the necessary tools installed: Prerequisites.
Model Formats: Learn about GGUF and how to get models: Model Formats.
First Inference: Start your first chat: CLI Usage.

Documentation Map

For a full view of how this documentation is organized, see the Directory Structure.

Last Updated: 2026-05-03

ProjectBreakdown-101

Explorer

llama.cpp on Windows

llama.cpp on Windows

Overview

Quick Start

🛠️ Installation

🚀 Running Models

Documentation Map

references

configuration

inference

installation

Directory Structure

Graph View

Table of Contents

Backlinks