awq-quantization by davila7/claude-code-templates
npx skills add https://github.com/davila7/claude-code-templates --skill awq-quantization基于激活模式保留重要权重的 4 比特量化方法,在最小精度损失下实现 3 倍加速。
在以下情况下使用 AWQ:
在以下情况下改用 GPTQ:
在以下情况下改用 bitsandbytes:
# 默认(Triton 内核)
pip install autoawq
# 使用优化的 CUDA 内核 + Flash Attention
pip install autoawq[kernels]
# Intel CPU/XPU 优化
pip install autoawq[cpu]
要求:Python 3.8+、CUDA 11.8+、计算能力 7.5+
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_name = "TheBloke/Mistral-7B-Instruct-v0.2-AWQ"
model = AutoAWQForCausalLM.from_quantized(
model_name,
fuse_layers=True # 启用融合注意力以提升速度
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# 生成
inputs = tokenizer("解释量子计算", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = "mistralai/Mistral-7B-Instruct-v0.2"
# 加载模型和分词器
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# 量化配置
quant_config = {
"zero_point": True, # 使用零点量化
"q_group_size": 128, # 分组大小(推荐 128)
"w_bit": 4, # 4 比特权重
"version": "GEMM" # GEMM 用于批处理,GEMV 用于单令牌
}
# 量化(默认使用 pileval 数据集)
model.quantize(tokenizer, quant_config=quant_config)
# 保存
model.save_quantized("mistral-7b-awq")
tokenizer.save_pretrained("mistral-7b-awq")
耗时:7B 模型约 10-15 分钟,70B 模型约 1 小时。
| 特性 | AWQ | GPTQ | bitsandbytes |
|---|---|---|---|
| 加速(4 比特) | ~2.5-3x | ~2x | ~1.5x |
| 精度损失 | <5% | ~5-10% | ~5-15% |
| 校准 | 最小化(128-1K 个令牌) | 更广泛 | 无 |
| 过拟合风险 | 低 | 较高 | 不适用 |
| 最佳适用场景 | 生产推理 | GPU 推理 | 简易集成 |
| vLLM 支持 | 原生 | 是 | 有限 |
关键见解:AWQ 假设并非所有权重都同等重要。它保护约 1% 由激活模式识别出的重要权重,从而在不引入混合精度开销的情况下减少量化误差。
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM" # 批处理大小 > 1 时最佳
}
quant_config = {
"version": "GEMV" # 批处理大小为 1 时快 20%
}
限制:仅支持批处理大小为 1,不适合长上下文。
from transformers import AwqConfig, AutoModelForCausalLM
config = AwqConfig(
bits=4,
version="marlin" # 在 A100/H100 上快 2 倍
)
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Mistral-7B-AWQ",
quantization_config=config
)
要求:计算能力 8.0+(A100、H100、RTX 40xx)
config = AwqConfig(
bits=4,
version="exllama" # 更快的预填充,支持 AMD GPU
)
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/zephyr-7B-alpha-AWQ",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("TheBloke/zephyr-7B-alpha-AWQ")
from transformers import AwqConfig, AutoModelForCausalLM
config = AwqConfig(
bits=4,
fuse_max_seq_len=512, # 融合的最大序列长度
do_fuse=True # 启用融合注意力/MLP
)
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Mistral-7B-OpenOrca-AWQ",
quantization_config=config
)
注意:融合模块不能与 FlashAttention2 结合使用。
from vllm import LLM, SamplingParams
# vLLM 自动检测 AWQ 模型
llm = LLM(
model="TheBloke/Llama-2-7B-AWQ",
quantization="awq",
dtype="half"
)
sampling = SamplingParams(temperature=0.7, max_tokens=200)
outputs = llm.generate(["解释人工智能"], sampling)
| 模型 | FP16 | AWQ 4 比特 | 减少量 |
|---|---|---|---|
| Mistral 7B | 14 GB | 5.5 GB | 2.5x |
| Llama 2-13B | 26 GB | 10 GB | 2.6x |
| Llama 2-70B | 140 GB | 35 GB | 4x |
| 模型 | 预填充(令牌/秒) | 解码(令牌/秒) | 内存 |
|---|---|---|---|
| Mistral 7B GEMM | 3,897 | 114 | 5.55 GB |
| TinyLlama 1B GEMV | 5,179 | 431 | 2.10 GB |
| Llama 2-13B GEMM | 2,279 | 74 | 10.28 GB |
| 模型 | FP16 | AWQ 4 比特 | 退化 |
|---|---|---|---|
| Llama 3 8B | 8.20 | 8.48 | +3.4% |
| Mistral 7B | 5.25 | 5.42 | +3.2% |
| Qwen2 72B | 4.85 | 4.95 | +2.1% |
# 为领域特定模型使用自定义数据集
model.quantize(
tokenizer,
quant_config=quant_config,
calib_data="wikitext", # 或自定义字符串列表
max_calib_samples=256, # 样本越多,精度越好
max_calib_seq_len=512 # 序列长度
)
# 或提供你自己的样本
calib_samples = [
"你的领域特定文本...",
"更多来自你用例的示例...",
]
model.quantize(tokenizer, quant_config=quant_config, calib_data=calib_samples)
model = AutoAWQForCausalLM.from_quantized(
"TheBloke/Llama-2-70B-AWQ",
device_map="auto", # 自动跨 GPU 分割
max_memory={0: "40GB", 1: "40GB"}
)
支持 35+ 种架构,包括:
量化期间 CUDA OOM:
# 减小批处理大小
model.quantize(tokenizer, quant_config=quant_config, max_calib_samples=64)
推理速度慢:
# 启用融合层
model = AutoAWQForCausalLM.from_quantized(model_name, fuse_layers=True)
AMD GPU 支持:
# 使用 ExLlama 后端
config = AwqConfig(bits=4, version="exllama")
AutoAWQ 已正式弃用。对于新项目,请考虑:
现有的量化模型仍可使用。
每周安装量
156
仓库
GitHub 星标数
23.4K
首次出现
2026 年 1 月 21 日
安全审计
安装于
claude-code129
opencode126
gemini-cli120
cursor120
codex109
antigravity105
4-bit quantization that preserves salient weights based on activation patterns, achieving 3x speedup with minimal accuracy loss.
Use AWQ when:
Use GPTQ instead when:
Use bitsandbytes instead when:
# Default (Triton kernels)
pip install autoawq
# With optimized CUDA kernels + Flash Attention
pip install autoawq[kernels]
# Intel CPU/XPU optimization
pip install autoawq[cpu]
Requirements : Python 3.8+, CUDA 11.8+, Compute Capability 7.5+
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_name = "TheBloke/Mistral-7B-Instruct-v0.2-AWQ"
model = AutoAWQForCausalLM.from_quantized(
model_name,
fuse_layers=True # Enable fused attention for speed
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Generate
inputs = tokenizer("Explain quantum computing", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = "mistralai/Mistral-7B-Instruct-v0.2"
# Load model and tokenizer
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Quantization config
quant_config = {
"zero_point": True, # Use zero-point quantization
"q_group_size": 128, # Group size (128 recommended)
"w_bit": 4, # 4-bit weights
"version": "GEMM" # GEMM for batch, GEMV for single-token
}
# Quantize (uses pileval dataset by default)
model.quantize(tokenizer, quant_config=quant_config)
# Save
model.save_quantized("mistral-7b-awq")
tokenizer.save_pretrained("mistral-7b-awq")
Timing : ~10-15 min for 7B, ~1 hour for 70B models.
| Feature | AWQ | GPTQ | bitsandbytes |
|---|---|---|---|
| Speedup (4-bit) | ~2.5-3x | ~2x | ~1.5x |
| Accuracy loss | <5% | ~5-10% | ~5-15% |
| Calibration | Minimal (128-1K tokens) | More extensive | None |
| Overfitting risk | Low | Higher | N/A |
| Best for | Production inference | GPU inference | Easy integration |
| vLLM support | Native | Yes | Limited |
Key insight : AWQ assumes not all weights are equally important. It protects ~1% of salient weights identified by activation patterns, reducing quantization error without mixed-precision overhead.
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM" # Best for batch sizes > 1
}
quant_config = {
"version": "GEMV" # 20% faster for batch_size=1
}
Limitation : Only batch size 1, not good for large context.
from transformers import AwqConfig, AutoModelForCausalLM
config = AwqConfig(
bits=4,
version="marlin" # 2x faster on A100/H100
)
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Mistral-7B-AWQ",
quantization_config=config
)
Requirements : Compute Capability 8.0+ (A100, H100, RTX 40xx)
config = AwqConfig(
bits=4,
version="exllama" # Faster prefill, AMD GPU support
)
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/zephyr-7B-alpha-AWQ",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("TheBloke/zephyr-7B-alpha-AWQ")
from transformers import AwqConfig, AutoModelForCausalLM
config = AwqConfig(
bits=4,
fuse_max_seq_len=512, # Max sequence length for fusing
do_fuse=True # Enable fused attention/MLP
)
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Mistral-7B-OpenOrca-AWQ",
quantization_config=config
)
Note : Fused modules cannot combine with FlashAttention2.
from vllm import LLM, SamplingParams
# vLLM auto-detects AWQ models
llm = LLM(
model="TheBloke/Llama-2-7B-AWQ",
quantization="awq",
dtype="half"
)
sampling = SamplingParams(temperature=0.7, max_tokens=200)
outputs = llm.generate(["Explain AI"], sampling)
| Model | FP16 | AWQ 4-bit | Reduction |
|---|---|---|---|
| Mistral 7B | 14 GB | 5.5 GB | 2.5x |
| Llama 2-13B | 26 GB | 10 GB | 2.6x |
| Llama 2-70B | 140 GB | 35 GB | 4x |
| Model | Prefill (tok/s) | Decode (tok/s) | Memory |
|---|---|---|---|
| Mistral 7B GEMM | 3,897 | 114 | 5.55 GB |
| TinyLlama 1B GEMV | 5,179 | 431 | 2.10 GB |
| Llama 2-13B GEMM | 2,279 | 74 | 10.28 GB |
| Model | FP16 | AWQ 4-bit | Degradation |
|---|---|---|---|
| Llama 3 8B | 8.20 | 8.48 | +3.4% |
| Mistral 7B | 5.25 | 5.42 | +3.2% |
| Qwen2 72B | 4.85 | 4.95 | +2.1% |
# Use custom dataset for domain-specific models
model.quantize(
tokenizer,
quant_config=quant_config,
calib_data="wikitext", # Or custom list of strings
max_calib_samples=256, # More samples = better accuracy
max_calib_seq_len=512 # Sequence length
)
# Or provide your own samples
calib_samples = [
"Your domain-specific text here...",
"More examples from your use case...",
]
model.quantize(tokenizer, quant_config=quant_config, calib_data=calib_samples)
model = AutoAWQForCausalLM.from_quantized(
"TheBloke/Llama-2-70B-AWQ",
device_map="auto", # Auto-split across GPUs
max_memory={0: "40GB", 1: "40GB"}
)
35+ architectures including:
CUDA OOM during quantization :
# Reduce batch size
model.quantize(tokenizer, quant_config=quant_config, max_calib_samples=64)
Slow inference :
# Enable fused layers
model = AutoAWQForCausalLM.from_quantized(model_name, fuse_layers=True)
AMD GPU support :
# Use ExLlama backend
config = AwqConfig(bits=4, version="exllama")
AutoAWQ is officially deprecated. For new projects, consider:
Existing quantized models remain usable.
Weekly Installs
156
Repository
GitHub Stars
23.4K
First Seen
Jan 21, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
claude-code129
opencode126
gemini-cli120
cursor120
codex109
antigravity105