Migrating to llama.cpp | Will Schenk

https://willschenk.com/howto/2026/migrating_to_llama_cpp/ · scraped

![](https://prod-files-secure.s3.us-west-2.amazonaws.com/871f1661-80b8-4d0c-ac3b-2adfc6ff4c66/17221180-eaa0-4324-847c-8df8471bf919/cover.png?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=ASIAZI2LB466XJGHFKRP%2F20260519%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Date=20260519T192335Z&X-Amz-Expires=3600&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEBMaCXVzLXdlc3QtMiJHMEUCIAab3xPJDno5MxwORwefYPLO%2B%2BXhUtMmhUsQnMokW6WoAiEAuDFs7ei%2BNGGrbRDWBiC5kePPUsuJOMSF%2BotnSmcdtpYqiAQI3P%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARAAGgw2Mzc0MjMxODM4MDUiDPt6%2BnR6tE6K6eAQZCrcA1GIgB%2Bjwz7Fod8gmRDEoQV0%2BNeLsvje6gUGuQOmMBzWKxtHTxo3SYR4R7H4oiMgbcut4f%2BrGIXET%2F%2BMWwZdLSLS6TYeK4HNmz4DUBGhC0Va3zRCTnMLfNuqXOOPUJMwT6%2B8QKusk6PVXPVn%2F%2BrEkL6rMQdOkBsUUZkqZXeNdoax5ZjGxugdfVAuhe7NtSI6aQb93p6W5wUoPKw%2FSQc3tAyV%2B0HBu5hz43FiNuvZXRDQmfJ6Mk0Y%2F7F1XwZvmyQLvJ%2B2bpCP6uL3WFISm%2FJzkyXLD7XXoM5jm%2BM7nf4TWBTt6yZzy%2BuhEmvz49%2BcttKffYflAIeGpdAoKYCLrV4bWeHU%2FAPz0ZYjO7%2BlnbAu9gouaYp%2BvOuwpzNskhY7C0FdtFYhF9D57ENLEY4ycrW9c%2BiNL1mB%2Bna6wzXwwOQyrjaaup21s0VgPdwXR8z7b%2FOvAozjyklt3mgM7yPw6LHkZdOU2DZ34WiqZDuvgBrhAYMCybiX%2FUvAzATCGL1DxIIYqs%2BkW4UhEG%2Fr2MXdQ4LAUokbNYcjb5bI%2FNsO3dlHfI%2BCDXxSr%2BJeetOYhdPEUIUk36OaO4SvWpRzyeol2i2MTMvO2aTtwrFBSLiLHthOZZt1OMsaOQIGD6OLN%2FuhMPLastAGOqUBdr%2Bm9%2FTXc5hksQx5%2FuEcfH4Sqr34J6VKnrcEro51IU05de08%2F9YnoFU97NtddBjjkyTRfGGqdyTMSMwjyg948cJLaZa91VyD5bJ705J%2Bn7JEw6sNmA2ETptwtgengD8OQIzIZpXUpGFy9Vjpd9mfYBvy91xyCvTxERT6OhxW81l6L9l5N5WzNOoweAAw23Uf4GCA2m5QArwX49hnd2IxbYm4CCIM&X-Amz-Signature=e39f799f240f4167be1ce51ce12efbc0eaefaa9c753cd5cffb6ad833ce5b3a62&X-Amz-SignedHeaders=host&x-amz-checksum-mode=ENABLED&x-id=GetObject) Ollama made local LLMs easy, but it comes with real downsides – it's slower than running llama.cpp directly, obscures what you're actually running, locks models into a hashed blob store, and trails upstream on new model support. The good news is that llama.cpp itself has gotten very easy to use. If you use Ollama, you probably do three things: 1. ollama run / ollama chat – download a model, chat with it interactively, have it unload when you're done 2. The Ollama API – point tools like Continue, aider, or Open WebUI at localhost:11434 for an OpenAI-compatible endpoint 3. The Ollama desktop app – a GUI to chat with models Here's the direct equivalent of each, and then we'll walk through setting it all up. | Ollama | llama.cpp equivalent | | ollama run gemma4 | llama-cli -hf ...:Q4_K_M -cnv | | ollama serve (API) | llama-server -hf ... or llama-swap | | Ollama desktop app | llama-server web UI at localhost:8080 | | ollama list | ls ~/.cache/huggingface/hub/ or hf cache ls | | ollama pull model | Automatic on first run with -hf | | Modelfile for parameters | CLI flags (--temp, --ctx-size, etc.) | | ~/.ollama/models (hashed) | ~/.cache/huggingface (readable dirs) | | Auto-unload after idle | llama-swap with ttl | On macOS with Homebrew: | 1 | brew install llama.cpp | That's it. You get llama-server, llama-cli, and the rest of the tools. Metal GPU acceleration works out of the box on Apple Silicon. You can also grab a pre-built binary from the releases page, or build from source: | 1 2 3 4 | git clone https://github.com/ggml-org/llama.cpp cd llama.cpp cmake -B build cmake --build build --config Release | We'll use Gemma 4 26B-A4B as our example. It's a Mixture-of-Experts model – 26B total parameters but only 3.8B active per token, so it runs almost as fast as a 4B model with much better quality. | Model | Total Params | Active Params | Type | Context | | Gemma 4 E2B | 5.1B | 2.3B | Dense | 128K | | Gemma 4 E4B | 8B | 4.5B | Dense | 128K | | Gemma 4 26B-A4B | 25.2B | 3.8B | MoE | 256K | | Gemma 4 31B | 30.7B | 30.7B | Dense | 256K | ### The quantization The rule of thumb: your model needs to fit in memory with room left over for the KV cache (which stores the conversation context). Head to unsloth/gemma-4-26B-A4B-it-GGUF on Hugging Face to see all the available sizes. | Quant | Size | Notes | | UD-IQ2_XXS | ~10 GB | Tight on RAM, willing to trade quality | | UD-Q3_K_M | ~12.5 GB | Good balance for constrained systems | | UD-Q4_K_M | ~17 GB | Best quality-per-GB sweet spot | | UD-Q5_K_M | ~21 GB | Noticeably better than Q4 | | UD-Q6_K | ~23 GB | Diminishing returns vs Q5 | | Q8_0 | ~27 GB | Near-lossless | | BF16 | ~50.5 GB | Full precision | On this M4 Max with 64GB, we can run Q8_0 (27 GB) or even BF16 (50.5 GB) and still have room for the full 256K context window. For most machines, Q4_K_M at ~17 GB is the sweet spot. Ollama only offers Q4_K_M and Q8_0. Here you get the full range from IQ2 to BF16, quantized by Unsloth with their Dynamic 2.0 method that selectively quantizes different layers to preserve quality. The context window isn't free – it requires a KV cache in memory on top of the model weights. The bigger the context, the more RAM the KV cache uses. For a single chat session this usually doesn't matter much, but if you're running a server handling multiple requests, or running multiple models via llama-swap, you may want to cap it to leave room. Use --ctx-size 0 to get the model's full trained context (256K for Gemma 4 26B-A4B). Or set a specific number if you need to budget memory. | 1 | llama-cli -hf unsloth/gemma-4-26B-A4B-it-GGUF:Q4_K_M -cnv | That's the whole thing. What happens: 1. Downloads the model from Hugging Face if you don't have it 2. Loads it into memory with Metal GPU acceleration 3. Starts an interactive chat (cnv is conversation mode) 4. Frees memory immediately when you quit (Ctrl+C) The chat template is read from the GGUF metadata – no Modelfile, no configuration. Unlike Ollama there's no background daemon; the process runs, you chat, you quit, it's gone. Want different parameters? Just add flags: | 1 2 3 4 5 | llama-cli \ -hf unsloth/gemma-4-26B-A4B-it-GGUF:Q8_0 \ --ctx-size 0 \ --temp 0.7 \ -cnv | Compare this to Ollama where changing temperature means creating a Modelfile, running ollama create, and potentially copying 20+ GB of model data. ## 2. API server (ollama serve replacement) | 1 | llama-server -hf unsloth/gemma-4-26B-A4B-it-GGUF:Q4_K_M --ctx-size 0 | This starts an OpenAI-compatible API on http://localhost:8080. Point any tool at it – Continue, aider, Open WebUI, or just curl: | 1 2 3 4 5 6 7 8 | curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "gemma-4", "messages": [ {"role": "user", "content": "Explain MoE architectures in two sentences"} ] }' | Or with Python: | 1 2 3 4 5 6 7 8 9 | from openai import OpenAI client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused") response = client.chat.completions.create( model="gemma-4", messages=[{"role": "user", "content": "Hello!"}], ) print(response.choices[0].message.content) | ## 3. Desktop chat UI (Ollama app replacement) llama-server includes a built-in web UI. Just start the server and open http://localhost:8080 in your browser. You get a chat interface, no extra app to install. Since llama.cpp stores models in the standard Hugging Face cache, you can just look: | 1 | ls ~/.cache/huggingface/hub/ | Which gives you readable directory names: | 1 2 | models--unsloth--gemma-4-26B-A4B-it-GGUF models--mlx-community--Qwen3.5-9B-MLX-4bit | For more detail, install the hf CLI: Which shows size and last access time: | 1 2 3 | id size last_accessed last_modified refs model/unsloth/gemma-4-26B-A4B-it-GGUF 18.1G 3 minutes ago 5 minutes ago ['main'] model/mlx-community/Qwen3.5-9B-MLX-4bit 6.0G 2 weeks ago 2 weeks ago ['main'] | A bare llama-server serves one model and runs until you stop it. If you want Ollama-style behavior where you can hit one endpoint with different model names and have them auto-load and auto-unload, that's what llama-swap is for. llama-swap is a lightweight Go proxy that sits in front of llama-server. When a request comes in, it looks at the model field, starts the right llama-server, and proxies the request. When a request comes in for a different model, it stops the old one and starts the new one. After the ttl expires with no requests, the model unloads and memory is freed. Create a config.yaml: | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | models:gemma4:cmd: llama-server --port ${PORT} -hf unsloth/gemma-4-26B-A4B-it-GGUF:Q4_K_M --ctx-size 0ttl: 120aliases:- gemma-4-26bgemma4-31b:cmd: llama-server --port ${PORT} -hf unsloth/gemma-4-31B-it-GGUF:Q4_K_M --ctx-size 0ttl: 120aliases:- gemma-4-31bgemma4-e2b:cmd: llama-server --port ${PORT} -hf unsloth/gemma-4-E2B-it-GGUF:Q4_K_M --ctx-size 0ttl: 120nemotron:cmd: llama-server --port ${PORT} -hf unsloth/Nemotron-3-Nano-30B-A3B-GGUF:Q4_K_M --ctx-size 0ttl: 120aliases:- nemotron-3-nanonemotron-4b:cmd: llama-server --port ${PORT} -hf unsloth/NVIDIA-Nemotron-3-Nano-4B-GGUF:Q8_0 --ctx-size 0ttl: 120qwen3:cmd: llama-server --port ${PORT} -hf unsloth/Qwen3-30B-A3B-GGUF:Q4_K_M --ctx-size 0ttl: 120aliases:- qwen3-30bqwen3-32b:cmd: llama-server --port ${PORT} -hf unsloth/Qwen3-32B-GGUF:Q4_K_M --ctx-size 0ttl: 120qwen3-coder:cmd: llama-server --port ${PORT} -hf unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q4_K_M --ctx-size 0ttl: 120aliases:- qwen-coder | - ${PORT}: llama-swap assigns a free port automatically - ttl: seconds of idle time before auto-unloading – 120 means 2 minutes. This is the Ollama auto-unload behavior. - aliases: friendly names for API calls. Point your tools at gpt-4o-mini and it routes to your local Gemma. Now it works just like Ollama – one endpoint, multiple models: | 1 2 3 4 5 6 7 | # This starts gemma4 automatically curl http://localhost:8080/v1/chat/completions \ -d '{"model": "gemma4", "messages": [{"role": "user", "content": "hi"}]}' # This stops gemma4 and starts qwen3-coder curl http://localhost:8080/v1/chat/completions \ -d '{"model": "qwen3-coder", "messages": [{"role": "user", "content": "write fizzbuzz"}]}' | First request to a model takes a few seconds while weights load. Subsequent requests are instant. After ttl with no activity, it unloads. llama-swap also has a web UI at http://localhost:8080/ui for monitoring running models, viewing token metrics, and manually loading/unloading. By default llama-swap runs one model at a time. If you have enough RAM, you can define a matrix to run multiple models simultaneously. On a 64GB machine you could comfortably run two Q4 models side by side. See the configuration docs for the matrix DSL. - Performance: community benchmarks show llama.cpp running 1.5-1.8x faster than Ollama on the same hardware. - New models immediately: GGUFs appear on Hugging Face within hours of a model release. With Ollama you wait for someone to package it for their registry. - Full quantization range: Ollama only offers a handful of quant levels. On Hugging Face you get IQ2 through BF16. - No lock-in: models are plain GGUF files shared with any tool. - Chat templates just work: llama.cpp reads Jinja templates embedded in the GGUF. No Modelfile, no Go template translation. - No background daemon: nothing running when you're not using it. - No VC pivot: llama.cpp is MIT-licensed, community-driven, and now part of the Hugging Face ecosystem.

▼

Scraped Content

— 1539 words · 2026-05-19 19:23:42 UTC ·

Excerpt

Visibility

Visible to everyone

Reading Status

Related Bookmarks

My Note

Saved!

Annotations

Agent findings

info Long content (1539 words) has no proposition chunks health · Jun 29

Export as Markdown