quantizing-models-bitsandbytes by davila7/claude-code-templates
npx skills add https://github.com/davila7/claude-code-templates --skill quantizing-models-bitsandbytesbitsandbytes 可将 LLM 内存占用减少 50%(8 位)或 75%(4 位),且精度损失小于 1%。
安装:
pip install bitsandbytes transformers accelerate
8 位量化(内存减少 50%):
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=config,
device_map="auto"
)
# 内存:14GB → 7GB
4 位量化(内存减少 75%):
config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=config,
device_map="auto"
)
# 内存:14GB → 3.5GB
复制此清单:
Quantization Loading:
- [ ] 步骤 1:计算内存需求
- [ ] 步骤 2:选择量化级别(4 位或 8 位)
- [ ] 步骤 3:配置量化
- [ ] 步骤 4:加载并验证模型
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
步骤 1:计算内存需求
估算模型内存:
FP16 内存 (GB) = 参数量 × 2 字节 / 1e9
INT8 内存 (GB) = 参数量 × 1 字节 / 1e9
INT4 内存 (GB) = 参数量 × 0.5 字节 / 1e9
示例 (Llama 2 7B):
FP16: 7B × 2 / 1e9 = 14 GB
INT8: 7B × 1 / 1e9 = 7 GB
INT4: 7B × 0.5 / 1e9 = 3.5 GB
步骤 2:选择量化级别
| GPU 显存 | 模型大小 | 推荐方案 |
|---|---|---|
| 8 GB | 3B | 4 位 |
| 12 GB | 7B | 4 位 |
| 16 GB | 7B | 8 位或 4 位 |
| 24 GB | 13B | 8 位或 70B 4 位 |
| 40+ GB | 70B | 8 位 |
步骤 3:配置量化
对于 8 位(精度更高):
from transformers import BitsAndBytesConfig
import torch
config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=6.0, # 异常值阈值
llm_int8_has_fp16_weight=False
)
对于 4 位(最大内存节省):
config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16, # 使用 FP16 计算
bnb_4bit_quant_type="nf4", # NormalFloat4(推荐)
bnb_4bit_use_double_quant=True # 嵌套量化
)
步骤 4:加载并验证模型
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-13b-hf",
quantization_config=config,
device_map="auto", # 自动设备放置
torch_dtype=torch.float16
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf")
# 测试推理
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0]))
# 检查内存
import torch
print(f"已分配内存: {torch.cuda.memory_allocated()/1e9:.2f}GB")
QLoRA 支持在消费级 GPU 上微调大模型。
复制此清单:
QLoRA Fine-tuning:
- [ ] 步骤 1:安装依赖项
- [ ] 步骤 2:配置 4 位基础模型
- [ ] 步骤 3:添加 LoRA 适配器
- [ ] 步骤 4:使用标准 Trainer 进行训练
步骤 1:安装依赖项
pip install bitsandbytes transformers peft accelerate datasets
步骤 2:配置 4 位基础模型
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto"
)
步骤 3:添加 LoRA 适配器
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
# 准备模型以进行训练
model = prepare_model_for_kbit_training(model)
# 配置 LoRA
lora_config = LoraConfig(
r=16, # LoRA 秩
lora_alpha=32, # LoRA alpha
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
# 添加 LoRA 适配器
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# 输出: trainable params: 4.2M || all params: 6.7B || trainable%: 0.06%
步骤 4:使用标准 Trainer 进行训练
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./qlora-output",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
num_train_epochs=3,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
save_strategy="epoch"
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
tokenizer=tokenizer
)
trainer.train()
# 保存 LoRA 适配器(仅约 20MB)
model.save_pretrained("./qlora-adapters")
使用 8 位 Adam/AdamW 将优化器内存减少 75%。
8-bit Optimizer Setup:
- [ ] 步骤 1:替换标准优化器
- [ ] 步骤 2:配置训练
- [ ] 步骤 3:监控内存节省
步骤 1:替换标准优化器
import bitsandbytes as bnb
from transformers import Trainer, TrainingArguments
# 替代 torch.optim.AdamW
model = AutoModelForCausalLM.from_pretrained("model-name")
training_args = TrainingArguments(
output_dir="./output",
per_device_train_batch_size=8,
optim="paged_adamw_8bit", # 8 位优化器
learning_rate=5e-5
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset
)
trainer.train()
手动使用优化器:
import bitsandbytes as bnb
optimizer = bnb.optim.AdamW8bit(
model.parameters(),
lr=1e-4,
betas=(0.9, 0.999),
eps=1e-8
)
# 训练循环
for batch in dataloader:
loss = model(**batch).loss
loss.backward()
optimizer.step()
optimizer.zero_grad()
步骤 2:配置训练
内存对比:
标准 AdamW 优化器内存 = 模型参数量 × 8 字节(状态)
8 位 AdamW 内存 = 模型参数量 × 2 字节
节省 = 75% 的优化器内存
示例 (Llama 2 7B):
标准: 7B × 8 = 56 GB
8 位: 7B × 2 = 14 GB
节省: 42 GB
步骤 3:监控内存节省
import torch
before = torch.cuda.memory_allocated()
# 训练步骤
optimizer.step()
after = torch.cuda.memory_allocated()
print(f"已使用内存: {(after-before)/1e9:.2f}GB")
在以下情况使用 bitsandbytes:
在以下情况使用替代方案:
问题:加载时出现 CUDA 错误
安装匹配的 CUDA 版本:
# 检查 CUDA 版本
nvcc --version
# 安装匹配的 bitsandbytes
pip install bitsandbytes --no-cache-dir
问题:模型加载缓慢
对大模型使用 CPU 卸载:
model = AutoModelForCausalLM.from_pretrained(
"model-name",
quantization_config=config,
device_map="auto",
max_memory={0: "20GB", "cpu": "30GB"} # 卸载到 CPU
)
问题:精度低于预期
尝试使用 8 位而非 4 位:
config = BitsAndBytesConfig(load_in_8bit=True)
# 8 位精度损失 <0.5%,而 4 位为 1-2%
或使用 NF4 配合双重量化:
config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # 优于 fp4
bnb_4bit_use_double_quant=True # 额外精度
)
问题:即使使用 4 位也出现 OOM
启用 CPU 卸载:
model = AutoModelForCausalLM.from_pretrained(
"model-name",
quantization_config=config,
device_map="auto",
offload_folder="offload", # 磁盘卸载
offload_state_dict=True
)
QLoRA 训练指南:完整的微调工作流、超参数调优和多 GPU 训练,请参阅 references/qlora-training.md。
量化格式:INT8、NF4、FP4 对比、双重量化和自定义量化配置,请参阅 references/quantization-formats.md。
内存优化:CPU 卸载策略、梯度检查点和内存分析,请参阅 references/memory-optimization.md。
支持的平台:NVIDIA GPU(主要)、AMD ROCm、Intel GPU(实验性)
每周安装量
166
代码仓库
GitHub 星标数
23.4K
首次出现
2026年1月21日
安全审计
已安装于
opencode135
claude-code134
gemini-cli128
cursor126
codex117
github-copilot112
bitsandbytes reduces LLM memory by 50% (8-bit) or 75% (4-bit) with <1% accuracy loss.
Installation :
pip install bitsandbytes transformers accelerate
8-bit quantization (50% memory reduction):
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=config,
device_map="auto"
)
# Memory: 14GB → 7GB
4-bit quantization (75% memory reduction):
config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=config,
device_map="auto"
)
# Memory: 14GB → 3.5GB
Copy this checklist:
Quantization Loading:
- [ ] Step 1: Calculate memory requirements
- [ ] Step 2: Choose quantization level (4-bit or 8-bit)
- [ ] Step 3: Configure quantization
- [ ] Step 4: Load and verify model
Step 1: Calculate memory requirements
Estimate model memory:
FP16 memory (GB) = Parameters × 2 bytes / 1e9
INT8 memory (GB) = Parameters × 1 byte / 1e9
INT4 memory (GB) = Parameters × 0.5 bytes / 1e9
Example (Llama 2 7B):
FP16: 7B × 2 / 1e9 = 14 GB
INT8: 7B × 1 / 1e9 = 7 GB
INT4: 7B × 0.5 / 1e9 = 3.5 GB
Step 2: Choose quantization level
| GPU VRAM | Model Size | Recommended |
|---|---|---|
| 8 GB | 3B | 4-bit |
| 12 GB | 7B | 4-bit |
| 16 GB | 7B | 8-bit or 4-bit |
| 24 GB | 13B | 8-bit or 70B 4-bit |
| 40+ GB | 70B | 8-bit |
Step 3: Configure quantization
For 8-bit (better accuracy):
from transformers import BitsAndBytesConfig
import torch
config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=6.0, # Outlier threshold
llm_int8_has_fp16_weight=False
)
For 4-bit (maximum memory savings):
config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16, # Compute in FP16
bnb_4bit_quant_type="nf4", # NormalFloat4 (recommended)
bnb_4bit_use_double_quant=True # Nested quantization
)
Step 4: Load and verify model
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-13b-hf",
quantization_config=config,
device_map="auto", # Automatic device placement
torch_dtype=torch.float16
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf")
# Test inference
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0]))
# Check memory
import torch
print(f"Memory allocated: {torch.cuda.memory_allocated()/1e9:.2f}GB")
QLoRA enables fine-tuning large models on consumer GPUs.
Copy this checklist:
QLoRA Fine-tuning:
- [ ] Step 1: Install dependencies
- [ ] Step 2: Configure 4-bit base model
- [ ] Step 3: Add LoRA adapters
- [ ] Step 4: Train with standard Trainer
Step 1: Install dependencies
pip install bitsandbytes transformers peft accelerate datasets
Step 2: Configure 4-bit base model
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto"
)
Step 3: Add LoRA adapters
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
# Prepare model for training
model = prepare_model_for_kbit_training(model)
# Configure LoRA
lora_config = LoraConfig(
r=16, # LoRA rank
lora_alpha=32, # LoRA alpha
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
# Add LoRA adapters
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 4.2M || all params: 6.7B || trainable%: 0.06%
Step 4: Train with standard Trainer
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./qlora-output",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
num_train_epochs=3,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
save_strategy="epoch"
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
tokenizer=tokenizer
)
trainer.train()
# Save LoRA adapters (only ~20MB)
model.save_pretrained("./qlora-adapters")
Use 8-bit Adam/AdamW to reduce optimizer memory by 75%.
8-bit Optimizer Setup:
- [ ] Step 1: Replace standard optimizer
- [ ] Step 2: Configure training
- [ ] Step 3: Monitor memory savings
Step 1: Replace standard optimizer
import bitsandbytes as bnb
from transformers import Trainer, TrainingArguments
# Instead of torch.optim.AdamW
model = AutoModelForCausalLM.from_pretrained("model-name")
training_args = TrainingArguments(
output_dir="./output",
per_device_train_batch_size=8,
optim="paged_adamw_8bit", # 8-bit optimizer
learning_rate=5e-5
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset
)
trainer.train()
Manual optimizer usage :
import bitsandbytes as bnb
optimizer = bnb.optim.AdamW8bit(
model.parameters(),
lr=1e-4,
betas=(0.9, 0.999),
eps=1e-8
)
# Training loop
for batch in dataloader:
loss = model(**batch).loss
loss.backward()
optimizer.step()
optimizer.zero_grad()
Step 2: Configure training
Compare memory:
Standard AdamW optimizer memory = model_params × 8 bytes (states)
8-bit AdamW memory = model_params × 2 bytes
Savings = 75% optimizer memory
Example (Llama 2 7B):
Standard: 7B × 8 = 56 GB
8-bit: 7B × 2 = 14 GB
Savings: 42 GB
Step 3: Monitor memory savings
import torch
before = torch.cuda.memory_allocated()
# Training step
optimizer.step()
after = torch.cuda.memory_allocated()
print(f"Memory used: {(after-before)/1e9:.2f}GB")
Use bitsandbytes when:
Use alternatives instead:
Issue: CUDA error during loading
Install matching CUDA version:
# Check CUDA version
nvcc --version
# Install matching bitsandbytes
pip install bitsandbytes --no-cache-dir
Issue: Model loading slow
Use CPU offload for large models:
model = AutoModelForCausalLM.from_pretrained(
"model-name",
quantization_config=config,
device_map="auto",
max_memory={0: "20GB", "cpu": "30GB"} # Offload to CPU
)
Issue: Lower accuracy than expected
Try 8-bit instead of 4-bit:
config = BitsAndBytesConfig(load_in_8bit=True)
# 8-bit has <0.5% accuracy loss vs 1-2% for 4-bit
Or use NF4 with double quantization:
config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # Better than fp4
bnb_4bit_use_double_quant=True # Extra accuracy
)
Issue: OOM even with 4-bit
Enable CPU offload:
model = AutoModelForCausalLM.from_pretrained(
"model-name",
quantization_config=config,
device_map="auto",
offload_folder="offload", # Disk offload
offload_state_dict=True
)
QLoRA training guide : See references/qlora-training.md for complete fine-tuning workflows, hyperparameter tuning, and multi-GPU training.
Quantization formats : See references/quantization-formats.md for INT8, NF4, FP4 comparison, double quantization, and custom quantization configs.
Memory optimization : See references/memory-optimization.md for CPU offloading strategies, gradient checkpointing, and memory profiling.
Supported platforms : NVIDIA GPUs (primary), AMD ROCm, Intel GPUs (experimental)
Weekly Installs
166
Repository
GitHub Stars
23.4K
First Seen
Jan 21, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
opencode135
claude-code134
gemini-cli128
cursor126
codex117
github-copilot112
超能力技能使用指南:AI助手技能调用优先级与工作流程详解
46,500 周安装