gguf-quantization by davila7/claude-code-templates
npx skills add https://github.com/davila7/claude-code-templates --skill gguf-quantizationGGUF(GPT-Generated Unified Format)是 llama.cpp 的标准文件格式,支持灵活的量化选项,可在 CPU、Apple Silicon 和 GPU 上实现高效推理。
在以下情况使用 GGUF:
主要优势:
替代方案适用场景:
# 克隆 llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# 构建(CPU)
make
# 使用 CUDA 构建(NVIDIA)
make GGML_CUDA=1
# 使用 Metal 构建(Apple Silicon)
make GGML_METAL=1
# 安装 Python 绑定(可选)
pip install llama-cpp-python
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
# 安装依赖
pip install -r requirements.txt
# 将 HuggingFace 模型转换为 GGUF(FP16)
python convert_hf_to_gguf.py ./path/to/model --outfile model-f16.gguf
# 或指定输出类型
python convert_hf_to_gguf.py ./path/to/model \
--outfile model-f16.gguf \
--outtype f16
# 基本量化到 Q4_K_M
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
# 使用重要性矩阵进行量化(质量更好)
./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
# CLI 推理
./llama-cli -m model-q4_k_m.gguf -p "Hello, how are you?"
# 交互模式
./llama-cli -m model-q4_k_m.gguf --interactive
# 使用 GPU 卸载
./llama-cli -m model-q4_k_m.gguf -ngl 35 -p "Hello!"
| 类型 | 比特数 | 大小(7B) | 质量 | 使用场景 |
|---|---|---|---|---|
| Q2_K | 2.5 | ~2.8 GB | 低 | 极限压缩 |
| Q3_K_S | 3.0 | ~3.0 GB | 低-中 | 内存受限 |
| Q3_K_M | 3.3 | ~3.3 GB | 中 | 平衡 |
| Q4_K_S | 4.0 | ~3.8 GB | 中-高 | 良好平衡 |
| Q4_K_M | 4.5 | ~4.1 GB | 高 | 推荐默认值 |
| Q5_K_S | 5.0 | ~4.6 GB | 高 | 注重质量 |
| Q5_K_M | 5.5 | ~4.8 GB | 非常高 | 高质量 |
| Q6_K | 6.0 | ~5.5 GB | 优秀 | 接近原始 |
| Q8_0 | 8.0 | ~7.2 GB | 最佳 | 最高质量 |
| 类型 | 描述 |
|---|---|
| Q4_0 | 4 位,基础 |
| Q4_1 | 4 位,带差值 |
| Q5_0 | 5 位,基础 |
| Q5_1 | 5 位,带差值 |
建议:使用 K-quant 方法(Q4_K_M、Q5_K_M)以获得最佳质量/大小比。
# 1. 下载模型
huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b
# 2. 转换为 GGUF(FP16)
python convert_hf_to_gguf.py ./llama-3.1-8b \
--outfile llama-3.1-8b-f16.gguf \
--outtype f16
# 3. 量化
./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M
# 4. 测试
./llama-cli -m llama-3.1-8b-q4_k_m.gguf -p "Hello!" -n 50
# 1. 转换为 GGUF
python convert_hf_to_gguf.py ./model --outfile model-f16.gguf
# 2. 创建校准文本(多样化样本)
cat > calibration.txt << 'EOF'
The quick brown fox jumps over the lazy dog.
Machine learning is a subset of artificial intelligence.
Python is a popular programming language.
# 添加更多多样化的文本样本...
EOF
# 3. 生成重要性矩阵
./llama-imatrix -m model-f16.gguf \
-f calibration.txt \
--chunk 512 \
-o model.imatrix \
-ngl 35 # 如果可用,使用 GPU 层
# 4. 使用 imatrix 进行量化
./llama-quantize --imatrix model.imatrix \
model-f16.gguf \
model-q4_k_m.gguf \
Q4_K_M
#!/bin/bash
MODEL="llama-3.1-8b-f16.gguf"
IMATRIX="llama-3.1-8b.imatrix"
# 生成一次 imatrix
./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35
# 创建多种量化版本
for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do
OUTPUT="llama-3.1-8b-${QUANT,,}.gguf"
./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT
echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))"
done
from llama_cpp import Llama
# 加载模型
llm = Llama(
model_path="./model-q4_k_m.gguf",
n_ctx=4096, # 上下文窗口
n_gpu_layers=35, # GPU 卸载层数(0 表示仅 CPU)
n_threads=8 # CPU 线程数
)
# 生成
output = llm(
"What is machine learning?",
max_tokens=256,
temperature=0.7,
stop=["</s>", "\n\n"]
)
print(output["choices"][0]["text"])
from llama_cpp import Llama
llm = Llama(
model_path="./model-q4_k_m.gguf",
n_ctx=4096,
n_gpu_layers=35,
chat_format="llama-3" # 或 "chatml"、"mistral" 等
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Python?"}
]
response = llm.create_chat_completion(
messages=messages,
max_tokens=256,
temperature=0.7
)
print(response["choices"][0]["message"]["content"])
from llama_cpp import Llama
llm = Llama(model_path="./model-q4_k_m.gguf", n_gpu_layers=35)
# 流式输出 token
for chunk in llm(
"Explain quantum computing:",
max_tokens=256,
stream=True
):
print(chunk["choices"][0]["text"], end="", flush=True)
# 启动服务器
./llama-server -m model-q4_k_m.gguf \
--host 0.0.0.0 \
--port 8080 \
-ngl 35 \
-c 4096
# 或使用 Python 绑定
python -m llama_cpp.server \
--model model-q4_k_m.gguf \
--n_gpu_layers 35 \
--host 0.0.0.0 \
--port 8080
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="local-model",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=256
)
print(response.choices[0].message.content)
# 使用 Metal 构建
make clean && make GGML_METAL=1
# 使用 Metal 加速运行
./llama-cli -m model.gguf -ngl 99 -p "Hello"
# Python 使用 Metal
llm = Llama(
model_path="model.gguf",
n_gpu_layers=99, # 卸载所有层
n_threads=1 # Metal 处理并行
)
# 使用 CUDA 构建
make clean && make GGML_CUDA=1
# 使用 CUDA 运行
./llama-cli -m model.gguf -ngl 35 -p "Hello"
# 指定 GPU
CUDA_VISIBLE_DEVICES=0 ./llama-cli -m model.gguf -ngl 35
# 使用 AVX2/AVX512 构建
make clean && make
# 使用最优线程数运行
./llama-cli -m model.gguf -t 8 -p "Hello"
# Python CPU 配置
llm = Llama(
model_path="model.gguf",
n_gpu_layers=0, # 仅 CPU
n_threads=8, # 匹配物理核心数
n_batch=512 # 提示处理的批大小
)
# 创建 Modelfile
cat > Modelfile << 'EOF'
FROM ./model-q4_k_m.gguf
TEMPLATE """{{ .System }}
{{ .Prompt }}"""
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
EOF
# 创建 Ollama 模型
ollama create mymodel -f Modelfile
# 运行
ollama run mymodel "Hello!"
~/.cache/lm-studio/models/# 放入 models 文件夹
cp model-q4_k_m.gguf text-generation-webui/models/
# 使用 llama.cpp 加载器启动
python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35
模型加载缓慢:
# 使用 mmap 加速加载
./llama-cli -m model.gguf --mmap
内存不足:
# 减少 GPU 层数
./llama-cli -m model.gguf -ngl 20 # 从 35 减少
# 或使用更小的量化
./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M
低比特率下质量差:
# 对于 Q4 及以下量化,始终使用 imatrix
./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
每周安装量
193
仓库
GitHub 星标数
23.6K
首次出现
2026年1月21日
安全审计
安装于
claude-code157
opencode157
gemini-cli152
cursor148
codex138
github-copilot134
The GGUF (GPT-Generated Unified Format) is the standard file format for llama.cpp, enabling efficient inference on CPUs, Apple Silicon, and GPUs with flexible quantization options.
Use GGUF when:
Key advantages:
Use alternatives instead:
# Clone llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# Build (CPU)
make
# Build with CUDA (NVIDIA)
make GGML_CUDA=1
# Build with Metal (Apple Silicon)
make GGML_METAL=1
# Install Python bindings (optional)
pip install llama-cpp-python
# Install requirements
pip install -r requirements.txt
# Convert HuggingFace model to GGUF (FP16)
python convert_hf_to_gguf.py ./path/to/model --outfile model-f16.gguf
# Or specify output type
python convert_hf_to_gguf.py ./path/to/model \
--outfile model-f16.gguf \
--outtype f16
# Basic quantization to Q4_K_M
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
# Quantize with importance matrix (better quality)
./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
# CLI inference
./llama-cli -m model-q4_k_m.gguf -p "Hello, how are you?"
# Interactive mode
./llama-cli -m model-q4_k_m.gguf --interactive
# With GPU offload
./llama-cli -m model-q4_k_m.gguf -ngl 35 -p "Hello!"
| Type | Bits | Size (7B) | Quality | Use Case |
|---|---|---|---|---|
| Q2_K | 2.5 | ~2.8 GB | Low | Extreme compression |
| Q3_K_S | 3.0 | ~3.0 GB | Low-Med | Memory constrained |
| Q3_K_M | 3.3 | ~3.3 GB | Medium | Balance |
| Q4_K_S | 4.0 | ~3.8 GB | Med-High | Good balance |
| Q4_K_M | 4.5 | ~4.1 GB | High | Recommended default |
| Q5_K_S | 5.0 | ~4.6 GB |
| Type | Description |
|---|---|
| Q4_0 | 4-bit, basic |
| Q4_1 | 4-bit with delta |
| Q5_0 | 5-bit, basic |
| Q5_1 | 5-bit with delta |
Recommendation : Use K-quant methods (Q4_K_M, Q5_K_M) for best quality/size ratio.
# 1. Download model
huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b
# 2. Convert to GGUF (FP16)
python convert_hf_to_gguf.py ./llama-3.1-8b \
--outfile llama-3.1-8b-f16.gguf \
--outtype f16
# 3. Quantize
./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M
# 4. Test
./llama-cli -m llama-3.1-8b-q4_k_m.gguf -p "Hello!" -n 50
# 1. Convert to GGUF
python convert_hf_to_gguf.py ./model --outfile model-f16.gguf
# 2. Create calibration text (diverse samples)
cat > calibration.txt << 'EOF'
The quick brown fox jumps over the lazy dog.
Machine learning is a subset of artificial intelligence.
Python is a popular programming language.
# Add more diverse text samples...
EOF
# 3. Generate importance matrix
./llama-imatrix -m model-f16.gguf \
-f calibration.txt \
--chunk 512 \
-o model.imatrix \
-ngl 35 # GPU layers if available
# 4. Quantize with imatrix
./llama-quantize --imatrix model.imatrix \
model-f16.gguf \
model-q4_k_m.gguf \
Q4_K_M
#!/bin/bash
MODEL="llama-3.1-8b-f16.gguf"
IMATRIX="llama-3.1-8b.imatrix"
# Generate imatrix once
./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35
# Create multiple quantizations
for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do
OUTPUT="llama-3.1-8b-${QUANT,,}.gguf"
./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT
echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))"
done
from llama_cpp import Llama
# Load model
llm = Llama(
model_path="./model-q4_k_m.gguf",
n_ctx=4096, # Context window
n_gpu_layers=35, # GPU offload (0 for CPU only)
n_threads=8 # CPU threads
)
# Generate
output = llm(
"What is machine learning?",
max_tokens=256,
temperature=0.7,
stop=["</s>", "\n\n"]
)
print(output["choices"][0]["text"])
from llama_cpp import Llama
llm = Llama(
model_path="./model-q4_k_m.gguf",
n_ctx=4096,
n_gpu_layers=35,
chat_format="llama-3" # Or "chatml", "mistral", etc.
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Python?"}
]
response = llm.create_chat_completion(
messages=messages,
max_tokens=256,
temperature=0.7
)
print(response["choices"][0]["message"]["content"])
from llama_cpp import Llama
llm = Llama(model_path="./model-q4_k_m.gguf", n_gpu_layers=35)
# Stream tokens
for chunk in llm(
"Explain quantum computing:",
max_tokens=256,
stream=True
):
print(chunk["choices"][0]["text"], end="", flush=True)
# Start server
./llama-server -m model-q4_k_m.gguf \
--host 0.0.0.0 \
--port 8080 \
-ngl 35 \
-c 4096
# Or with Python bindings
python -m llama_cpp.server \
--model model-q4_k_m.gguf \
--n_gpu_layers 35 \
--host 0.0.0.0 \
--port 8080
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="local-model",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=256
)
print(response.choices[0].message.content)
# Build with Metal
make clean && make GGML_METAL=1
# Run with Metal acceleration
./llama-cli -m model.gguf -ngl 99 -p "Hello"
# Python with Metal
llm = Llama(
model_path="model.gguf",
n_gpu_layers=99, # Offload all layers
n_threads=1 # Metal handles parallelism
)
# Build with CUDA
make clean && make GGML_CUDA=1
# Run with CUDA
./llama-cli -m model.gguf -ngl 35 -p "Hello"
# Specify GPU
CUDA_VISIBLE_DEVICES=0 ./llama-cli -m model.gguf -ngl 35
# Build with AVX2/AVX512
make clean && make
# Run with optimal threads
./llama-cli -m model.gguf -t 8 -p "Hello"
# Python CPU config
llm = Llama(
model_path="model.gguf",
n_gpu_layers=0, # CPU only
n_threads=8, # Match physical cores
n_batch=512 # Batch size for prompt processing
)
# Create Modelfile
cat > Modelfile << 'EOF'
FROM ./model-q4_k_m.gguf
TEMPLATE """{{ .System }}
{{ .Prompt }}"""
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
EOF
# Create Ollama model
ollama create mymodel -f Modelfile
# Run
ollama run mymodel "Hello!"
~/.cache/lm-studio/models/# Place in models folder
cp model-q4_k_m.gguf text-generation-webui/models/
# Start with llama.cpp loader
python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35
Model loads slowly:
# Use mmap for faster loading
./llama-cli -m model.gguf --mmap
Out of memory:
# Reduce GPU layers
./llama-cli -m model.gguf -ngl 20 # Reduce from 35
# Or use smaller quantization
./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M
Poor quality at low bits:
# Always use imatrix for Q4 and below
./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
Weekly Installs
193
Repository
GitHub Stars
23.6K
First Seen
Jan 21, 2026
Security Audits
Gen Agent Trust HubWarnSocketPassSnykWarn
Installed on
claude-code157
opencode157
gemini-cli152
cursor148
codex138
github-copilot134
AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具
50,900 周安装
| High |
| Quality focused |
| Q5_K_M | 5.5 | ~4.8 GB | Very High | High quality |
| Q6_K | 6.0 | ~5.5 GB | Excellent | Near-original |
| Q8_0 | 8.0 | ~7.2 GB | Best | Maximum quality |