llama.cpp on Windows

Welcome to the complete guide for setting up and using llama.cpp on a Windows environment. This documentation is designed to take you from zero to running your first Large Language Model (LLM) locally, regardless of whether you are using a CPU, an NVIDIA GPU (CUDA), or an AMD GPU (HIP).

Overview

llama.cpp is a highly optimized C/C++ implementation of the LLaMA model architecture and other LLM architectures. It is designed to run efficiently on consumer hardware by leveraging quantization techniques, allowing large models to fit into smaller amounts of RAM and VRAM.

On Windows, llama.cpp provides a powerful way to:

  • Run state-of-the-art LLMs locally.
  • Utilize hardware acceleration via NVIDIA CUDA or AMD ROCm/HIP.
  • Serve models through a local API server.
  • Perform fast inference with minimal overhead.

Quick Start

Depending on your hardware, follow the relevant installation path:

🛠️ Installation

🚀 Running Models

  • Prerequisites: Ensure you have the necessary tools installed: Prerequisites.
  • Model Formats: Learn about GGUF and how to get models: Model Formats.
  • First Inference: Start your first chat: CLI Usage.

Documentation Map

For a full view of how this documentation is organized, see the Directory Structure.


Last Updated: 2026-05-03

5 items under this folder.