AWQ激活感知权重量化：4比特模型量化方法，实现3倍推理加速与最小精度损失

awq-quantization by davila7/claude-code-templates

156 周安装量

23,400 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/davila7/claude-code-templates --skill awq-quantization

AI/机器学习开发运维性能优化

🇨🇳中文介绍

AWQ（激活感知权重量化）

基于激活模式保留重要权重的 4 比特量化方法，在最小精度损失下实现 3 倍加速。

何时使用 AWQ

在以下情况下使用 AWQ：

需要 4 比特量化且精度损失 <5%
部署指令微调或聊天模型（AWQ 泛化能力更好）
希望获得相比 FP16 约 2.5-3 倍的推理加速
使用 vLLM 进行生产环境服务
拥有支持 Marlin 内核的 Ampere+ 系列 GPU（A100、H100、RTX 40xx）

在以下情况下改用 GPTQ：

需要最大的生态系统兼容性（更多工具支持 GPTQ）
专门使用 ExLlamaV2 后端
使用不支持 Marlin 的旧款 GPU

在以下情况下改用 bitsandbytes：

需要零校准开销（即时量化）
希望使用 QLoRA 进行微调
偏好更简单的集成

快速开始

安装

# 默认（Triton 内核）
pip install autoawq

# 使用优化的 CUDA 内核 + Flash Attention
pip install autoawq[kernels]

# Intel CPU/XPU 优化
pip install autoawq[cpu]

要求：Python 3.8+、CUDA 11.8+、计算能力 7.5+

加载预量化模型

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_name = "TheBloke/Mistral-7B-Instruct-v0.2-AWQ"

model = AutoAWQForCausalLM.from_quantized(
    model_name,
    fuse_layers=True  # 启用融合注意力以提升速度
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 生成
inputs = tokenizer("解释量子计算", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

量化你自己的模型

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "mistralai/Mistral-7B-Instruct-v0.2"

# 加载模型和分词器
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# 量化配置
quant_config = {
    "zero_point": True,      # 使用零点量化
    "q_group_size": 128,     # 分组大小（推荐 128）
    "w_bit": 4,              # 4 比特权重
    "version": "GEMM"        # GEMM 用于批处理，GEMV 用于单令牌
}

# 量化（默认使用 pileval 数据集）
model.quantize(tokenizer, quant_config=quant_config)

# 保存
model.save_quantized("mistral-7b-awq")
tokenizer.save_pretrained("mistral-7b-awq")

耗时：7B 模型约 10-15 分钟，70B 模型约 1 小时。

AWQ vs GPTQ vs bitsandbytes

特性	AWQ	GPTQ	bitsandbytes
加速（4 比特）	~2.5-3x	~2x	~1.5x
精度损失	<5%	~5-10%	~5-15%
校准	最小化（128-1K 个令牌）	更广泛	无
过拟合风险	低	较高	不适用
最佳适用场景	生产推理	GPU 推理	简易集成
vLLM 支持	原生	是	有限

关键见解：AWQ 假设并非所有权重都同等重要。它保护约 1% 由激活模式识别出的重要权重，从而在不引入混合精度开销的情况下减少量化误差。

GEMM（默认，批处理推理）

quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"  # 批处理大小 > 1 时最佳
}

GEMV（单令牌生成）

quant_config = {
    "version": "GEMV"  # 批处理大小为 1 时快 20%
}

限制：仅支持批处理大小为 1，不适合长上下文。

Marlin（Ampere+ GPU）

from transformers import AwqConfig, AutoModelForCausalLM

config = AwqConfig(
    bits=4,
    version="marlin"  # 在 A100/H100 上快 2 倍
)

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-AWQ",
    quantization_config=config
)

要求：计算能力 8.0+（A100、H100、RTX 40xx）

ExLlamaV2（兼容 AMD）

config = AwqConfig(
    bits=4,
    version="exllama"  # 更快的预填充，支持 AMD GPU
)

HuggingFace Transformers 集成

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/zephyr-7B-alpha-AWQ",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("TheBloke/zephyr-7B-alpha-AWQ")

融合模块（推荐）

from transformers import AwqConfig, AutoModelForCausalLM

config = AwqConfig(
    bits=4,
    fuse_max_seq_len=512,  # 融合的最大序列长度
    do_fuse=True           # 启用融合注意力/MLP
)

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-OpenOrca-AWQ",
    quantization_config=config
)

注意：融合模块不能与 FlashAttention2 结合使用。

from vllm import LLM, SamplingParams

# vLLM 自动检测 AWQ 模型
llm = LLM(
    model="TheBloke/Llama-2-7B-AWQ",
    quantization="awq",
    dtype="half"
)

sampling = SamplingParams(temperature=0.7, max_tokens=200)
outputs = llm.generate(["解释人工智能"], sampling)

模型	FP16	AWQ 4 比特	减少量
Mistral 7B	14 GB	5.5 GB	2.5x
Llama 2-13B	26 GB	10 GB	2.6x
Llama 2-70B	140 GB	35 GB	4x

推理速度（RTX 4090）

模型	预填充（令牌/秒）	解码（令牌/秒）	内存
Mistral 7B GEMM	3,897	114	5.55 GB
TinyLlama 1B GEMV	5,179	431	2.10 GB
Llama 2-13B GEMM	2,279	74	10.28 GB

精度（困惑度）

模型	FP16	AWQ 4 比特	退化
Llama 3 8B	8.20	8.48	+3.4%
Mistral 7B	5.25	5.42	+3.2%
Qwen2 72B	4.85	4.95	+2.1%

自定义校准数据

# 为领域特定模型使用自定义数据集
model.quantize(
    tokenizer,
    quant_config=quant_config,
    calib_data="wikitext",       # 或自定义字符串列表
    max_calib_samples=256,       # 样本越多，精度越好
    max_calib_seq_len=512        # 序列长度
)

# 或提供你自己的样本
calib_samples = [
    "你的领域特定文本...",
    "更多来自你用例的示例...",
]
model.quantize(tokenizer, quant_config=quant_config, calib_data=calib_samples)

model = AutoAWQForCausalLM.from_quantized(
    "TheBloke/Llama-2-70B-AWQ",
    device_map="auto",  # 自动跨 GPU 分割
    max_memory={0: "40GB", 1: "40GB"}
)

支持 35+ 种架构，包括：

Llama 系列：Llama 2/3、Code Llama、Mistral、Mixtral
Qwen：Qwen、Qwen2、Qwen2.5-VL
其他：Falcon、MPT、Phi、Yi、DeepSeek、Gemma
多模态：LLaVA、LLaVA-Next、Qwen2-VL

量化期间 CUDA OOM：

# 减小批处理大小
model.quantize(tokenizer, quant_config=quant_config, max_calib_samples=64)

推理速度慢：

# 启用融合层
model = AutoAWQForCausalLM.from_quantized(model_name, fuse_layers=True)

AMD GPU 支持：

# 使用 ExLlama 后端
config = AwqConfig(bits=4, version="exllama")

AutoAWQ 已正式弃用。对于新项目，请考虑：

vLLM llm-compressor：https://github.com/vllm-project/llm-compressor
MLX-LM：适用于配备 Apple Silicon 的 Mac 设备

现有的量化模型仍可使用。

论文：AWQ: Activation-aware Weight Quantization (arXiv:2306.00978) - MLSys 2024 最佳论文
GitHub：https://github.com/casper-hansen/AutoAWQ
MIT Han Lab：https://github.com/mit-han-lab/llm-awq
模型：https://huggingface.co/models?library=awq

2026 年 1 月 21 日

🇺🇸English

AWQ (Activation-aware Weight Quantization)

4-bit quantization that preserves salient weights based on activation patterns, achieving 3x speedup with minimal accuracy loss.

When to use AWQ

Use AWQ when:

Need 4-bit quantization with <5% accuracy loss
Deploying instruction-tuned or chat models (AWQ generalizes better)
Want ~2.5-3x inference speedup over FP16
Using vLLM for production serving
Have Ampere+ GPUs (A100, H100, RTX 40xx) for Marlin kernel support

Use GPTQ instead when:

Need maximum ecosystem compatibility (more tools support GPTQ)
Working with ExLlamaV2 backend specifically
Have older GPUs without Marlin support

Use bitsandbytes instead when:

Need zero calibration overhead (quantize on-the-fly)
Want to fine-tune with QLoRA
Prefer simpler integration

Quick start

Installation

# Default (Triton kernels)
pip install autoawq

# With optimized CUDA kernels + Flash Attention
pip install autoawq[kernels]

# Intel CPU/XPU optimization
pip install autoawq[cpu]

Requirements : Python 3.8+, CUDA 11.8+, Compute Capability 7.5+

Load pre-quantized model

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_name = "TheBloke/Mistral-7B-Instruct-v0.2-AWQ"

model = AutoAWQForCausalLM.from_quantized(
    model_name,
    fuse_layers=True  # Enable fused attention for speed
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Generate
inputs = tokenizer("Explain quantum computing", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quantize your own model

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "mistralai/Mistral-7B-Instruct-v0.2"

# Load model and tokenizer
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Quantization config
quant_config = {
    "zero_point": True,      # Use zero-point quantization
    "q_group_size": 128,     # Group size (128 recommended)
    "w_bit": 4,              # 4-bit weights
    "version": "GEMM"        # GEMM for batch, GEMV for single-token
}

# Quantize (uses pileval dataset by default)
model.quantize(tokenizer, quant_config=quant_config)

# Save
model.save_quantized("mistral-7b-awq")
tokenizer.save_pretrained("mistral-7b-awq")

Timing : ~10-15 min for 7B, ~1 hour for 70B models.

AWQ vs GPTQ vs bitsandbytes

Feature	AWQ	GPTQ	bitsandbytes
Speedup (4-bit)	~2.5-3x	~2x	~1.5x
Accuracy loss	<5%	~5-10%	~5-15%
Calibration	Minimal (128-1K tokens)	More extensive	None
Overfitting risk	Low	Higher	N/A
Best for	Production inference	GPU inference	Easy integration
vLLM support	Native	Yes	Limited

Key insight : AWQ assumes not all weights are equally important. It protects ~1% of salient weights identified by activation patterns, reducing quantization error without mixed-precision overhead.

Kernel backends

GEMM (default, batch inference)

quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"  # Best for batch sizes > 1
}

GEMV (single-token generation)

quant_config = {
    "version": "GEMV"  # 20% faster for batch_size=1
}

Limitation : Only batch size 1, not good for large context.

Marlin (Ampere+ GPUs)

from transformers import AwqConfig, AutoModelForCausalLM

config = AwqConfig(
    bits=4,
    version="marlin"  # 2x faster on A100/H100
)

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-AWQ",
    quantization_config=config
)

Requirements : Compute Capability 8.0+ (A100, H100, RTX 40xx)

ExLlamaV2 (AMD compatible)

config = AwqConfig(
    bits=4,
    version="exllama"  # Faster prefill, AMD GPU support
)

HuggingFace Transformers integration

Direct loading

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/zephyr-7B-alpha-AWQ",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("TheBloke/zephyr-7B-alpha-AWQ")

Fused modules (recommended)

from transformers import AwqConfig, AutoModelForCausalLM

config = AwqConfig(
    bits=4,
    fuse_max_seq_len=512,  # Max sequence length for fusing
    do_fuse=True           # Enable fused attention/MLP
)

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-OpenOrca-AWQ",
    quantization_config=config
)

Note : Fused modules cannot combine with FlashAttention2.

vLLM integration

from vllm import LLM, SamplingParams

# vLLM auto-detects AWQ models
llm = LLM(
    model="TheBloke/Llama-2-7B-AWQ",
    quantization="awq",
    dtype="half"
)

sampling = SamplingParams(temperature=0.7, max_tokens=200)
outputs = llm.generate(["Explain AI"], sampling)

Performance benchmarks

Memory reduction

Model	FP16	AWQ 4-bit	Reduction
Mistral 7B	14 GB	5.5 GB	2.5x
Llama 2-13B	26 GB	10 GB	2.6x
Llama 2-70B	140 GB	35 GB	4x

Inference speed (RTX 4090)

Model	Prefill (tok/s)	Decode (tok/s)	Memory
Mistral 7B GEMM	3,897	114	5.55 GB
TinyLlama 1B GEMV	5,179	431	2.10 GB
Llama 2-13B GEMM	2,279	74	10.28 GB

Accuracy (perplexity)

Model	FP16	AWQ 4-bit	Degradation
Llama 3 8B	8.20	8.48	+3.4%
Mistral 7B	5.25	5.42	+3.2%
Qwen2 72B	4.85	4.95	+2.1%

Custom calibration data

# Use custom dataset for domain-specific models
model.quantize(
    tokenizer,
    quant_config=quant_config,
    calib_data="wikitext",       # Or custom list of strings
    max_calib_samples=256,       # More samples = better accuracy
    max_calib_seq_len=512        # Sequence length
)

# Or provide your own samples
calib_samples = [
    "Your domain-specific text here...",
    "More examples from your use case...",
]
model.quantize(tokenizer, quant_config=quant_config, calib_data=calib_samples)

Multi-GPU deployment

model = AutoAWQForCausalLM.from_quantized(
    "TheBloke/Llama-2-70B-AWQ",
    device_map="auto",  # Auto-split across GPUs
    max_memory={0: "40GB", 1: "40GB"}
)

Supported models

35+ architectures including:

Llama family : Llama 2/3, Code Llama, Mistral, Mixtral
Qwen : Qwen, Qwen2, Qwen2.5-VL
Others : Falcon, MPT, Phi, Yi, DeepSeek, Gemma
Multimodal : LLaVA, LLaVA-Next, Qwen2-VL

Common issues

CUDA OOM during quantization :

# Reduce batch size
model.quantize(tokenizer, quant_config=quant_config, max_calib_samples=64)

Slow inference :

# Enable fused layers
model = AutoAWQForCausalLM.from_quantized(model_name, fuse_layers=True)

AMD GPU support :

# Use ExLlama backend
config = AwqConfig(bits=4, version="exllama")

Deprecation notice

AutoAWQ is officially deprecated. For new projects, consider:

vLLM llm-compressor : https://github.com/vllm-project/llm-compressor
MLX-LM : For Mac devices with Apple Silicon

Existing quantized models remain usable.

References

Paper : AWQ: Activation-aware Weight Quantization (arXiv:2306.00978) - MLSys 2024 Best Paper
GitHub : https://github.com/casper-hansen/AutoAWQ
MIT Han Lab : https://github.com/mit-han-lab/llm-awq
Models : https://huggingface.co/models?library=awq

Weekly Installs

156

Repository

davila7/claude-…emplates

GitHub Stars

23.4K

First Seen

Jan 21, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

claude-code129

opencode126

gemini-cli120

cursor120

codex109

antigravity105

AWQ激活感知权重量化：4比特模型量化方法，实现3倍推理加速与最小精度损失

🇨🇳中文介绍

AWQ（激活感知权重量化）

何时使用 AWQ

快速开始

安装

加载预量化模型

相关 Skills

量化你自己的模型

AWQ vs GPTQ vs bitsandbytes

内核后端

GEMM（默认，批处理推理）

GEMV（单令牌生成）

Marlin（Ampere+ GPU）

ExLlamaV2（兼容 AMD）

HuggingFace Transformers 集成

直接加载

融合模块（推荐）

vLLM 集成

性能基准测试

内存减少

推理速度（RTX 4090）

精度（困惑度）

自定义校准数据

多 GPU 部署

支持的模型

常见问题

弃用通知

参考资料

🇺🇸English

AWQ (Activation-aware Weight Quantization)

When to use AWQ

Quick start

Installation

Load pre-quantized model

Quantize your own model

AWQ vs GPTQ vs bitsandbytes

Kernel backends

GEMM (default, batch inference)

GEMV (single-token generation)

Marlin (Ampere+ GPUs)

ExLlamaV2 (AMD compatible)

HuggingFace Transformers integration

Direct loading

Fused modules (recommended)

vLLM integration

Performance benchmarks

Memory reduction

Inference speed (RTX 4090)

Accuracy (perplexity)

Custom calibration data

Multi-GPU deployment

Supported models

Common issues

Deprecation notice

References

最新 Skills