Use to select models to run locally with llama.cpp and GGUF on CPU, Mac Metal, CUDA, or ROCm. Covers finding GGUFs, quant selection, running servers, exact GGUF file lookup, conversion, and OpenAI-compatible local serving.
90
93%
Does it follow best practices?
Impact
73%
1.25xAverage score across 3 eval scenarios
Advisory
Suggest reviewing before use
Search the Hugging Face Hub for llama.cpp-compatible GGUF repos, choose the
right quant, and launch the model with llama-cli or llama-server.
apps=llama.cpp.https://huggingface.co/<repo>?local-app=llama.cpp..gguf filenames with
https://huggingface.co/api/models/<repo>/tree/main?recursive=true.llama-cli -hf <repo>:<QUANT> or
llama-server -hf <repo>:<QUANT>.--hf-repo plus --hf-file when the repo uses custom file
naming.brew install llama.cpp
winget install llama.cppgit clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
makehf auth loginhttps://huggingface.co/models?apps=llama.cpp&sort=trending
https://huggingface.co/models?search=Qwen3.6&apps=llama.cpp&sort=trending
https://huggingface.co/models?search=<term>&apps=llama.cpp&num_parameters=min:0,max:24B&sort=trendingllama-cli -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M
llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_Mllama-server \
--hf-repo unsloth/Qwen3.6-35B-A3B-GGUF \
--hf-file Qwen3.6-35B-A3B-UD-Q4_K_M.gguf \
-c 4096hf download <repo-without-gguf> --local-dir ./model-src
python convert_hf_to_gguf.py ./model-src \
--outfile model-f16.gguf \
--outtype f16
llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_Mllama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_Mcurl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"messages": [
{"role": "user", "content": "Write a limerick about exception handling"}
]
}'?local-app=llama.cpp page.UD-Q4_K_M instead of normalizing them.Q4_K_M unless the repo page or hardware profile suggests
otherwise.Q5_K_M or Q6_K for code or technical workloads when memory allows.Q3_K_M, Q4_K_S, or repo-specific IQ / UD-* variants for
tighter RAM or VRAM budgets.mmproj-*.gguf files as projector weights, not the main checkpoint.imatrix.https://github.com/ggml-org/llama.cpphttps://huggingface.co/docs/hub/gguf-llamacpphttps://huggingface.co/docs/hub/main/local-appshttps://huggingface.co/docs/hub/agents-localhttps://huggingface.co/spaces/ggml-org/gguf-my-repo0448a7c
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.