Use to select models to run locally with llama.cpp and GGUF on CPU, Mac Metal, CUDA, or ROCm. Covers finding GGUFs, quant selection, running servers, exact GGUF file lookup, conversion, and OpenAI-compatible local serving.
90
93%
Does it follow best practices?
Impact
73%
1.25xAverage score across 3 eval scenarios
Advisory
Suggest reviewing before use
Hub-first quant selection and server setup for code workload
Search with apps=llama.cpp
0%
0%
Local-app page consulted
0%
0%
Tree API used
0%
0%
Code-workload quant
0%
33%
Repo-native labels preserved
80%
80%
llama-server -hf command
100%
100%
OpenAI-compatible curl smoke test
50%
60%
No conversion for GGUF repos
100%
100%
CPU thread flag
50%
37%
HF snippet as source of truth
0%
0%
Memory-constrained model selection with size filter and quant choice
apps=llama.cpp search filter
0%
100%
Parameter size filter
25%
33%
Local-app page opened
0%
100%
Tree API for exact file
0%
100%
Memory-constrained quant selected
66%
66%
Repo-native labels not normalized
100%
100%
llama-cli -hf command
0%
100%
No conversion suggested
100%
100%
URL-first approach
50%
100%
mmproj excluded from main model
100%
100%
Transformers-to-GGUF conversion workflow for repo without GGUF files
hf download command
100%
100%
convert_hf_to_gguf.py used
100%
100%
f16 outtype for initial conversion
100%
100%
llama-quantize for quantization
100%
100%
Appropriate NVIDIA quant
100%
100%
llama-server launch command
100%
100%
GPU layers flag
100%
100%
Smoke test curl
100%
100%
Conversion justified by no GGUF
40%
30%
hf auth login mentioned
100%
62%
0448a7c
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.