CLI Usage

Once you have built llama.cpp and downloaded a model, you are ready to start interacting with it. The primary tool for this is llama-cli.exe.

🧠 Key Concepts for Beginners

What is Inference?

Inference is a fancy word for “running the model.” It is the process where you give the AI a prompt (a question or instruction) and it uses its trained knowledge to generate a response.


🚀 Basic Syntax

The simplest way to run a model is to provide the path to the model file and a text prompt.

./bin/Release/llama-cli.exe -m "C:\AI\models\your-model-q4_k_m.gguf" -p "The capital of France is"

🛠️ Essential Flags

To get the most out of your experience, you will frequently use these flags:

FlagLong NameDescription
-m--modelRequired. The full path to your .gguf model file.
-p--promptThe text prompt you want the AI to respond to.
-n--n-predictThe maximum number of tokens the AI should generate (e.g., -n 512).
-t--threadsThe number of CPU threads to use. Usually set to the number of physical cores on your CPU.
-ngl--n-gpu-layersThe Magic Button for GPU users. This tells the computer to move parts of the model from your slow RAM to your fast GPU memory (VRAM).
-cnv--conversationEnables a chat-like interactive mode.
-temp--tempSets the “temperature” (creativity). Higher = more creative/random, Lower = more focused/deterministic.

🔍 How to Read the Output

When you run llama-cli.exe, your terminal will show a lot of text. Don’t be intimidated! Here is what to look for:

  1. The Setup Phase: You’ll see lines about “loading model,” “system info,” and “graph computation.” This is just the computer getting ready.
  2. The GPU Check: If you used -ngl, look for lines mentioning “CUDA” or “HIP”. If you see them, your GPU is working!
  3. The Response: After a short pause, the AI will start typing its answer. This is the part you actually care about.
  4. The Stats: Once the AI is done, it will show some statistics like “tokens per second” (how fast it was) and “time to first token” (how long it took to start talking).

🚀 Advanced Usage Examples

1. The “Standard” CPU Run

Best for users without a dedicated GPU.

./bin/Release/llama-cli.exe -m "C:\AI\models\llama3-8b.gguf" -p "Write a poem about coding." -n 256 -t 8

2. The “Fast” GPU Run (NVIDIA/AMD)

This is how you get high speed. By offloading layers to your GPU, you move the heavy lifting away from your CPU.

If your GPU has enough VRAM, set this to a high number (like 33, 50, or even 99 to ensure all layers are offloaded).

./bin/Release/llama-cli.exe -m "C:\AI\models\llama3-8b.gguf" -p "Explain quantum physics." -ngl 33

Note: If you see “failed to allocate” or your computer freezes, your GPU doesn’t have enough VRAM for that many layers. Try a smaller number (e.g., -ngl 10) or a smaller/more quantized model.

3. Interactive Chat Mode

Instead of a single prompt and exit, this mode lets you have a back-and-forth conversation.

./bin/Release/llama-cli.exe -m "C:\AI\models\llama3-8b.gguf" -cnv -p "You are a helpful assistant."

Last Updated: 2026-05-03