GGUF量化指南：在CPU/GPU/Apple Silicon上高效运行大语言模型

gguf-quantization by davila7/claude-code-templates

237 周安装量

24,200 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/davila7/claude-code-templates --skill gguf-quantization

AI/机器学习开源项目性能优化

🇨🇳中文介绍

GGUF - llama.cpp 的量化格式

GGUF（GPT-Generated Unified Format）是 llama.cpp 的标准文件格式，支持灵活的量化选项，可在 CPU、Apple Silicon 和 GPU 上实现高效推理。

何时使用 GGUF

在以下情况使用 GGUF：

在消费级硬件（笔记本电脑、台式机）上部署
在支持 Metal 加速的 Apple Silicon（M1/M2/M3）上运行
需要 CPU 推理，无需 GPU
需要灵活的量化选项（Q2_K 到 Q8_0）
使用本地 AI 工具（LM Studio、Ollama、text-generation-webui）

主要优势：

通用硬件支持：支持 CPU、Apple Silicon、NVIDIA、AMD
无需 Python 运行时：纯 C/C++ 推理
灵活的量化：2-8 位，多种方法（K-quants）
生态系统支持：LM Studio、Ollama、koboldcpp 等
imatrix：重要性矩阵，提升低比特率下的质量

替代方案适用场景：

AWQ/GPTQ：在 NVIDIA GPU 上通过校准实现最高精度
HQQ：适用于 HuggingFace 的快速免校准量化
bitsandbytes：与 transformers 库的简单集成
TensorRT-LLM：用于生产环境 NVIDIA 部署，追求极致速度

快速开始

安装

# 克隆 llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

# 构建（CPU）
make

# 使用 CUDA 构建（NVIDIA）
make GGML_CUDA=1

# 使用 Metal 构建（Apple Silicon）
make GGML_METAL=1

# 安装 Python 绑定（可选）
pip install llama-cpp-python

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

将模型转换为 GGUF

# 安装依赖
pip install -r requirements.txt

# 将 HuggingFace 模型转换为 GGUF（FP16）
python convert_hf_to_gguf.py ./path/to/model --outfile model-f16.gguf

# 或指定输出类型
python convert_hf_to_gguf.py ./path/to/model \
    --outfile model-f16.gguf \
    --outtype f16

# 基本量化到 Q4_K_M
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M

# 使用重要性矩阵进行量化（质量更好）
./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M

# CLI 推理
./llama-cli -m model-q4_k_m.gguf -p "Hello, how are you?"

# 交互模式
./llama-cli -m model-q4_k_m.gguf --interactive

# 使用 GPU 卸载
./llama-cli -m model-q4_k_m.gguf -ngl 35 -p "Hello!"

K-quant 方法（推荐）

类型	比特数	大小（7B）	质量	使用场景
Q2_K	2.5	~2.8 GB	低	极限压缩
Q3_K_S	3.0	~3.0 GB	低-中	内存受限
Q3_K_M	3.3	~3.3 GB	中	平衡
Q4_K_S	4.0	~3.8 GB	中-高	良好平衡
Q4_K_M	4.5	~4.1 GB	高	推荐默认值
Q5_K_S	5.0	~4.6 GB	高	注重质量
Q5_K_M	5.5	~4.8 GB	非常高	高质量
Q6_K	6.0	~5.5 GB	优秀	接近原始
Q8_0	8.0	~7.2 GB	最佳	最高质量

类型	描述
Q4_0	4 位，基础
Q4_1	4 位，带差值
Q5_0	5 位，基础
Q5_1	5 位，带差值

建议：使用 K-quant 方法（Q4_K_M、Q5_K_M）以获得最佳质量/大小比。

工作流程 1：从 HuggingFace 到 GGUF

# 1. 下载模型
huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b

# 2. 转换为 GGUF（FP16）
python convert_hf_to_gguf.py ./llama-3.1-8b \
    --outfile llama-3.1-8b-f16.gguf \
    --outtype f16

# 3. 量化
./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M

# 4. 测试
./llama-cli -m llama-3.1-8b-q4_k_m.gguf -p "Hello!" -n 50

工作流程 2：使用重要性矩阵（质量更佳）

# 1. 转换为 GGUF
python convert_hf_to_gguf.py ./model --outfile model-f16.gguf

# 2. 创建校准文本（多样化样本）
cat > calibration.txt << 'EOF'
The quick brown fox jumps over the lazy dog.
Machine learning is a subset of artificial intelligence.
Python is a popular programming language.
# 添加更多多样化的文本样本...
EOF

# 3. 生成重要性矩阵
./llama-imatrix -m model-f16.gguf \
    -f calibration.txt \
    --chunk 512 \
    -o model.imatrix \
    -ngl 35  # 如果可用，使用 GPU 层

# 4. 使用 imatrix 进行量化
./llama-quantize --imatrix model.imatrix \
    model-f16.gguf \
    model-q4_k_m.gguf \
    Q4_K_M

工作流程 3：多种量化

#!/bin/bash
MODEL="llama-3.1-8b-f16.gguf"
IMATRIX="llama-3.1-8b.imatrix"

# 生成一次 imatrix
./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35

# 创建多种量化版本
for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do
    OUTPUT="llama-3.1-8b-${QUANT,,}.gguf"
    ./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT
    echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))"
done

from llama_cpp import Llama

# 加载模型
llm = Llama(
    model_path="./model-q4_k_m.gguf",
    n_ctx=4096,          # 上下文窗口
    n_gpu_layers=35,     # GPU 卸载层数（0 表示仅 CPU）
    n_threads=8          # CPU 线程数
)

# 生成
output = llm(
    "What is machine learning?",
    max_tokens=256,
    temperature=0.7,
    stop=["</s>", "\n\n"]
)
print(output["choices"][0]["text"])

from llama_cpp import Llama

llm = Llama(
    model_path="./model-q4_k_m.gguf",
    n_ctx=4096,
    n_gpu_layers=35,
    chat_format="llama-3"  # 或 "chatml"、"mistral" 等
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Python?"}
]

response = llm.create_chat_completion(
    messages=messages,
    max_tokens=256,
    temperature=0.7
)
print(response["choices"][0]["message"]["content"])

from llama_cpp import Llama

llm = Llama(model_path="./model-q4_k_m.gguf", n_gpu_layers=35)

# 流式输出 token
for chunk in llm(
    "Explain quantum computing:",
    max_tokens=256,
    stream=True
):
    print(chunk["choices"][0]["text"], end="", flush=True)

启动 OpenAI 兼容服务器

# 启动服务器
./llama-server -m model-q4_k_m.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 35 \
    -c 4096

# 或使用 Python 绑定
python -m llama_cpp.server \
    --model model-q4_k_m.gguf \
    --n_gpu_layers 35 \
    --host 0.0.0.0 \
    --port 8080

与 OpenAI 客户端一起使用

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="local-model",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=256
)
print(response.choices[0].message.content)

Apple Silicon（Metal）

# 使用 Metal 构建
make clean && make GGML_METAL=1

# 使用 Metal 加速运行
./llama-cli -m model.gguf -ngl 99 -p "Hello"

# Python 使用 Metal
llm = Llama(
    model_path="model.gguf",
    n_gpu_layers=99,     # 卸载所有层
    n_threads=1          # Metal 处理并行
)

# 使用 CUDA 构建
make clean && make GGML_CUDA=1

# 使用 CUDA 运行
./llama-cli -m model.gguf -ngl 35 -p "Hello"

# 指定 GPU
CUDA_VISIBLE_DEVICES=0 ./llama-cli -m model.gguf -ngl 35

# 使用 AVX2/AVX512 构建
make clean && make

# 使用最优线程数运行
./llama-cli -m model.gguf -t 8 -p "Hello"

# Python CPU 配置
llm = Llama(
    model_path="model.gguf",
    n_gpu_layers=0,      # 仅 CPU
    n_threads=8,         # 匹配物理核心数
    n_batch=512          # 提示处理的批大小
)

# 创建 Modelfile
cat > Modelfile << 'EOF'
FROM ./model-q4_k_m.gguf
TEMPLATE """{{ .System }}
{{ .Prompt }}"""
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
EOF

# 创建 Ollama 模型
ollama create mymodel -f Modelfile

# 运行
ollama run mymodel "Hello!"

将 GGUF 文件放入 ~/.cache/lm-studio/models/
打开 LM Studio 并选择模型
配置上下文长度和 GPU 卸载
开始推理

text-generation-webui

# 放入 models 文件夹
cp model-q4_k_m.gguf text-generation-webui/models/

# 使用 llama.cpp 加载器启动
python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35

使用 K-quants：Q4_K_M 提供最佳质量/大小平衡
使用 imatrix：对于 Q4 及以下量化，始终使用重要性矩阵
GPU 卸载：根据显存允许尽可能多地卸载层
上下文长度：从 4096 开始，根据需要增加
线程数：匹配物理 CPU 核心数，而非逻辑核心数
批大小：增加 n_batch 以加速提示处理

模型加载缓慢：

# 使用 mmap 加速加载
./llama-cli -m model.gguf --mmap

# 减少 GPU 层数
./llama-cli -m model.gguf -ngl 20  # 从 35 减少

# 或使用更小的量化
./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M

低比特率下质量差：

# 对于 Q4 及以下量化，始终使用 imatrix
./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M

高级用法 - 批处理、推测解码、自定义构建
故障排除 - 常见问题、调试、基准测试

🇺🇸English

GGUF - Quantization Format for llama.cpp

The GGUF (GPT-Generated Unified Format) is the standard file format for llama.cpp, enabling efficient inference on CPUs, Apple Silicon, and GPUs with flexible quantization options.

When to use GGUF

Use GGUF when:

Deploying on consumer hardware (laptops, desktops)
Running on Apple Silicon (M1/M2/M3) with Metal acceleration
Need CPU inference without GPU requirements
Want flexible quantization (Q2_K to Q8_0)
Using local AI tools (LM Studio, Ollama, text-generation-webui)

Key advantages:

Universal hardware : CPU, Apple Silicon, NVIDIA, AMD support
No Python runtime : Pure C/C++ inference
Flexible quantization : 2-8 bit with various methods (K-quants)
Ecosystem support : LM Studio, Ollama, koboldcpp, and more
imatrix : Importance matrix for better low-bit quality

Use alternatives instead:

AWQ/GPTQ : Maximum accuracy with calibration on NVIDIA GPUs
HQQ : Fast calibration-free quantization for HuggingFace
bitsandbytes : Simple integration with transformers library
TensorRT-LLM : Production NVIDIA deployment with maximum speed

Quick start

Installation

# Clone llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

# Build (CPU)
make

# Build with CUDA (NVIDIA)
make GGML_CUDA=1

# Build with Metal (Apple Silicon)
make GGML_METAL=1

# Install Python bindings (optional)
pip install llama-cpp-python

Convert model to GGUF

# Install requirements
pip install -r requirements.txt

# Convert HuggingFace model to GGUF (FP16)
python convert_hf_to_gguf.py ./path/to/model --outfile model-f16.gguf

# Or specify output type
python convert_hf_to_gguf.py ./path/to/model \
    --outfile model-f16.gguf \
    --outtype f16

Quantize model

# Basic quantization to Q4_K_M
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M

# Quantize with importance matrix (better quality)
./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M

Run inference

# CLI inference
./llama-cli -m model-q4_k_m.gguf -p "Hello, how are you?"

# Interactive mode
./llama-cli -m model-q4_k_m.gguf --interactive

# With GPU offload
./llama-cli -m model-q4_k_m.gguf -ngl 35 -p "Hello!"

Quantization types

K-quant methods (recommended)

Type	Bits	Size (7B)	Quality	Use Case
Q2_K	2.5	~2.8 GB	Low	Extreme compression
Q3_K_S	3.0	~3.0 GB	Low-Med	Memory constrained
Q3_K_M	3.3	~3.3 GB	Medium	Balance
Q4_K_S	4.0	~3.8 GB	Med-High	Good balance
Q4_K_M	4.5	~4.1 GB	High	Recommended default
Q5_K_S	5.0	~4.6 GB

Legacy methods

Type	Description
Q4_0	4-bit, basic
Q4_1	4-bit with delta
Q5_0	5-bit, basic
Q5_1	5-bit with delta

Recommendation : Use K-quant methods (Q4_K_M, Q5_K_M) for best quality/size ratio.

Conversion workflows

Workflow 1: HuggingFace to GGUF

# 1. Download model
huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b

# 2. Convert to GGUF (FP16)
python convert_hf_to_gguf.py ./llama-3.1-8b \
    --outfile llama-3.1-8b-f16.gguf \
    --outtype f16

# 3. Quantize
./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M

# 4. Test
./llama-cli -m llama-3.1-8b-q4_k_m.gguf -p "Hello!" -n 50

Workflow 2: With importance matrix (better quality)

# 1. Convert to GGUF
python convert_hf_to_gguf.py ./model --outfile model-f16.gguf

# 2. Create calibration text (diverse samples)
cat > calibration.txt << 'EOF'
The quick brown fox jumps over the lazy dog.
Machine learning is a subset of artificial intelligence.
Python is a popular programming language.
# Add more diverse text samples...
EOF

# 3. Generate importance matrix
./llama-imatrix -m model-f16.gguf \
    -f calibration.txt \
    --chunk 512 \
    -o model.imatrix \
    -ngl 35  # GPU layers if available

# 4. Quantize with imatrix
./llama-quantize --imatrix model.imatrix \
    model-f16.gguf \
    model-q4_k_m.gguf \
    Q4_K_M

Workflow 3: Multiple quantizations

#!/bin/bash
MODEL="llama-3.1-8b-f16.gguf"
IMATRIX="llama-3.1-8b.imatrix"

# Generate imatrix once
./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35

# Create multiple quantizations
for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do
    OUTPUT="llama-3.1-8b-${QUANT,,}.gguf"
    ./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT
    echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))"
done

Python usage

llama-cpp-python

from llama_cpp import Llama

# Load model
llm = Llama(
    model_path="./model-q4_k_m.gguf",
    n_ctx=4096,          # Context window
    n_gpu_layers=35,     # GPU offload (0 for CPU only)
    n_threads=8          # CPU threads
)

# Generate
output = llm(
    "What is machine learning?",
    max_tokens=256,
    temperature=0.7,
    stop=["</s>", "\n\n"]
)
print(output["choices"][0]["text"])

Chat completion

from llama_cpp import Llama

llm = Llama(
    model_path="./model-q4_k_m.gguf",
    n_ctx=4096,
    n_gpu_layers=35,
    chat_format="llama-3"  # Or "chatml", "mistral", etc.
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Python?"}
]

response = llm.create_chat_completion(
    messages=messages,
    max_tokens=256,
    temperature=0.7
)
print(response["choices"][0]["message"]["content"])

Streaming

from llama_cpp import Llama

llm = Llama(model_path="./model-q4_k_m.gguf", n_gpu_layers=35)

# Stream tokens
for chunk in llm(
    "Explain quantum computing:",
    max_tokens=256,
    stream=True
):
    print(chunk["choices"][0]["text"], end="", flush=True)

Server mode

Start OpenAI-compatible server

# Start server
./llama-server -m model-q4_k_m.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 35 \
    -c 4096

# Or with Python bindings
python -m llama_cpp.server \
    --model model-q4_k_m.gguf \
    --n_gpu_layers 35 \
    --host 0.0.0.0 \
    --port 8080

Use with OpenAI client

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="local-model",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=256
)
print(response.choices[0].message.content)

Hardware optimization

Apple Silicon (Metal)

# Build with Metal
make clean && make GGML_METAL=1

# Run with Metal acceleration
./llama-cli -m model.gguf -ngl 99 -p "Hello"

# Python with Metal
llm = Llama(
    model_path="model.gguf",
    n_gpu_layers=99,     # Offload all layers
    n_threads=1          # Metal handles parallelism
)

NVIDIA CUDA

# Build with CUDA
make clean && make GGML_CUDA=1

# Run with CUDA
./llama-cli -m model.gguf -ngl 35 -p "Hello"

# Specify GPU
CUDA_VISIBLE_DEVICES=0 ./llama-cli -m model.gguf -ngl 35

CPU optimization

# Build with AVX2/AVX512
make clean && make

# Run with optimal threads
./llama-cli -m model.gguf -t 8 -p "Hello"

# Python CPU config
llm = Llama(
    model_path="model.gguf",
    n_gpu_layers=0,      # CPU only
    n_threads=8,         # Match physical cores
    n_batch=512          # Batch size for prompt processing
)

Integration with tools

Ollama

# Create Modelfile
cat > Modelfile << 'EOF'
FROM ./model-q4_k_m.gguf
TEMPLATE """{{ .System }}
{{ .Prompt }}"""
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
EOF

# Create Ollama model
ollama create mymodel -f Modelfile

# Run
ollama run mymodel "Hello!"

LM Studio

Place GGUF file in ~/.cache/lm-studio/models/
Open LM Studio and select the model
Configure context length and GPU offload
Start inference

text-generation-webui

# Place in models folder
cp model-q4_k_m.gguf text-generation-webui/models/

# Start with llama.cpp loader
python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35

Best practices

Use K-quants : Q4_K_M offers best quality/size balance
Use imatrix : Always use importance matrix for Q4 and below
GPU offload : Offload as many layers as VRAM allows
Context length : Start with 4096, increase if needed
Thread count : Match physical CPU cores, not logical
Batch size : Increase n_batch for faster prompt processing

Common issues

Model loads slowly:

# Use mmap for faster loading
./llama-cli -m model.gguf --mmap

Out of memory:

# Reduce GPU layers
./llama-cli -m model.gguf -ngl 20  # Reduce from 35

# Or use smaller quantization
./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M

Poor quality at low bits:

# Always use imatrix for Q4 and below
./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M

References

Advanced Usage - Batching, speculative decoding, custom builds
Troubleshooting - Common issues, debugging, benchmarks

Resources

Repository : https://github.com/ggml-org/llama.cpp
Python Bindings : https://github.com/abetlen/llama-cpp-python
Pre-quantized Models : https://huggingface.co/TheBloke
GGUF Converter : https://huggingface.co/spaces/ggml-org/gguf-my-repo
License : MIT

Weekly Installs

193

Repository

davila7/claude-…emplates

GitHub Stars

23.6K

First Seen

Jan 21, 2026

Security Audits

Gen Agent Trust HubWarn SocketPass SnykWarn

Installed on

claude-code157

opencode157

gemini-cli152

cursor148

codex138

github-copilot134

AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具

50,900 周安装

GGUF量化指南：在CPU/GPU/Apple Silicon上高效运行大语言模型

🇨🇳中文介绍

GGUF - llama.cpp 的量化格式

何时使用 GGUF

快速开始

安装

相关 Skills

将模型转换为 GGUF

量化模型

运行推理

量化类型

K-quant 方法（推荐）

传统方法

转换工作流程

工作流程 1：从 HuggingFace 到 GGUF

工作流程 2：使用重要性矩阵（质量更佳）

工作流程 3：多种量化

Python 使用

llama-cpp-python

聊天补全

流式输出

服务器模式

启动 OpenAI 兼容服务器

与 OpenAI 客户端一起使用

硬件优化

Apple Silicon（Metal）

NVIDIA CUDA

CPU 优化

与工具集成

Ollama

LM Studio

text-generation-webui

最佳实践

常见问题

参考资料

资源

🇺🇸English

GGUF - Quantization Format for llama.cpp

When to use GGUF

Quick start

Installation

Convert model to GGUF

Quantize model

Run inference

Quantization types

K-quant methods (recommended)

Legacy methods

Conversion workflows

Workflow 1: HuggingFace to GGUF

Workflow 2: With importance matrix (better quality)

Workflow 3: Multiple quantizations

Python usage

llama-cpp-python

Chat completion

Streaming

Server mode

Start OpenAI-compatible server

Use with OpenAI client

Hardware optimization

Apple Silicon (Metal)

NVIDIA CUDA

CPU optimization

Integration with tools

Ollama

LM Studio

text-generation-webui

Best practices

Common issues

References

Resources

最新 Skills