重要前提
安装AI Skills的关键前提是:必须科学上网,且开启TUN模式,这一点至关重要,直接决定安装能否顺利完成,在此郑重提醒三遍:科学上网,科学上网,科学上网。查看完整安装教程 →
quantizing-models-bitsandbytes by orchestra-research/ai-research-skills
npx skills add https://github.com/orchestra-research/ai-research-skills --skill quantizing-models-bitsandbytesbitsandbytes 可将 LLM 内存占用减少 50%(8 位)或 75%(4 位),而精度损失小于 1%。
安装:
pip install bitsandbytes transformers accelerate
8 位量化(内存减少 50%):
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=config,
device_map="auto"
)
# 内存:14GB → 7GB
4 位量化(内存减少 75%):
config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=config,
device_map="auto"
)
# 内存:14GB → 3.5GB
复制此清单:
Quantization Loading:
- [ ] 步骤 1:计算内存需求
- [ ] 步骤 2:选择量化级别(4 位或 8 位)
- [ ] 步骤 3:配置量化
- [ ] 步骤 4:加载并验证模型
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
步骤 1:计算内存需求
估算模型内存:
FP16 内存 (GB) = 参数量 × 2 字节 / 1e9
INT8 内存 (GB) = 参数量 × 1 字节 / 1e9
INT4 内存 (GB) = 参数量 × 0.5 字节 / 1e9
示例 (Llama 2 7B):
FP16: 7B × 2 / 1e9 = 14 GB
INT8: 7B × 1 / 1e9 = 7 GB
INT4: 7B × 0.5 / 1e9 = 3.5 GB
步骤 2:选择量化级别
| GPU VRAM | 模型大小 | 推荐 |
|---|---|---|
| 8 GB | 3B | 4 位 |
| 12 GB | 7B | 4 位 |
| 16 GB | 7B | 8 位或 4 位 |
| 24 GB | 13B | 8 位或 70B 4 位 |
| 40+ GB | 70B | 8 位 |
步骤 3:配置量化
对于 8 位(精度更好):
from transformers import BitsAndBytesConfig
import torch
config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=6.0, # 异常值阈值
llm_int8_has_fp16_weight=False
)
对于 4 位(最大内存节省):
config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16, # 使用 FP16 计算
bnb_4bit_quant_type="nf4", # NormalFloat4(推荐)
bnb_4bit_use_double_quant=True # 嵌套量化
)
步骤 4:加载并验证模型
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-13b-hf",
quantization_config=config,
device_map="auto", # 自动设备放置
torch_dtype=torch.float16
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf")
# 测试推理
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0]))
# 检查内存
import torch
print(f"已分配内存: {torch.cuda.memory_allocated()/1e9:.2f}GB")
QLoRA 支持在消费级 GPU 上微调大模型。
复制此清单:
QLoRA Fine-tuning:
- [ ] 步骤 1:安装依赖项
- [ ] 步骤 2:配置 4 位基础模型
- [ ] 步骤 3:添加 LoRA 适配器
- [ ] 步骤 4:使用标准 Trainer 进行训练
步骤 1:安装依赖项
pip install bitsandbytes transformers peft accelerate datasets
步骤 2:配置 4 位基础模型
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto"
)
步骤 3:添加 LoRA 适配器
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
# 准备模型以进行训练
model = prepare_model_for_kbit_training(model)
# 配置 LoRA
lora_config = LoraConfig(
r=16, # LoRA 秩
lora_alpha=32, # LoRA alpha
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
# 添加 LoRA 适配器
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# 输出:trainable params: 4.2M || all params: 6.7B || trainable%: 0.06%
步骤 4:使用标准 Trainer 进行训练
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./qlora-output",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
num_train_epochs=3,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
save_strategy="epoch"
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
tokenizer=tokenizer
)
trainer.train()
# 保存 LoRA 适配器(仅约 20MB)
model.save_pretrained("./qlora-adapters")
使用 8 位 Adam/AdamW 可将优化器内存减少 75%。
8-bit Optimizer Setup:
- [ ] 步骤 1:替换标准优化器
- [ ] 步骤 2:配置训练
- [ ] 步骤 3:监控内存节省情况
步骤 1:替换标准优化器
import bitsandbytes as bnb
from transformers import Trainer, TrainingArguments
# 替代 torch.optim.AdamW
model = AutoModelForCausalLM.from_pretrained("model-name")
training_args = TrainingArguments(
output_dir="./output",
per_device_train_batch_size=8,
optim="paged_adamw_8bit", # 8 位优化器
learning_rate=5e-5
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset
)
trainer.train()
手动使用优化器:
import bitsandbytes as bnb
optimizer = bnb.optim.AdamW8bit(
model.parameters(),
lr=1e-4,
betas=(0.9, 0.999),
eps=1e-8
)
# 训练循环
for batch in dataloader:
loss = model(**batch).loss
loss.backward()
optimizer.step()
optimizer.zero_grad()
步骤 2:配置训练
内存对比:
标准 AdamW 优化器内存 = 模型参数 × 8 字节(状态)
8 位 AdamW 内存 = 模型参数 × 2 字节
节省 = 75% 优化器内存
示例 (Llama 2 7B):
标准:7B × 8 = 56 GB
8 位:7B × 2 = 14 GB
节省:42 GB
步骤 3:监控内存节省情况
import torch
before = torch.cuda.memory_allocated()
# 训练步骤
optimizer.step()
after = torch.cuda.memory_allocated()
print(f"已使用内存: {(after-before)/1e9:.2f}GB")
在以下情况使用 bitsandbytes:
改用替代方案:
问题:加载时出现 CUDA 错误
安装匹配的 CUDA 版本:
# 检查 CUDA 版本
nvcc --version
# 安装匹配的 bitsandbytes
pip install bitsandbytes --no-cache-dir
问题:模型加载缓慢
对大模型使用 CPU 卸载:
model = AutoModelForCausalLM.from_pretrained(
"model-name",
quantization_config=config,
device_map="auto",
max_memory={0: "20GB", "cpu": "30GB"} # 卸载到 CPU
)
问题:精度低于预期
尝试使用 8 位而非 4 位:
config = BitsAndBytesConfig(load_in_8bit=True)
# 8 位精度损失 <0.5%,而 4 位为 1-2%
或使用 NF4 并启用双重量化:
config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # 优于 fp4
bnb_4bit_use_double_quant=True # 额外精度
)
问题:即使使用 4 位也出现 OOM
启用 CPU 卸载:
model = AutoModelForCausalLM.from_pretrained(
"model-name",
quantization_config=config,
device_map="auto",
offload_folder="offload", # 磁盘卸载
offload_state_dict=True
)
QLoRA 训练指南:完整的微调工作流、超参数调优和多 GPU 训练,请参阅 references/qlora-training.md。
量化格式:INT8、NF4、FP4 对比、双重量化和自定义量化配置,请参阅 references/quantization-formats.md。
内存优化:CPU 卸载策略、梯度检查点和内存分析,请参阅 references/memory-optimization.md。
支持的平台:NVIDIA GPU(主要)、AMD ROCm、Intel GPU(实验性)
每周安装量
69
代码仓库
GitHub 星标数
5.6K
首次出现
2026 年 2 月 7 日
安全审计
安装于
opencode60
codex59
cursor59
gemini-cli58
github-copilot57
claude-code56
bitsandbytes reduces LLM memory by 50% (8-bit) or 75% (4-bit) with <1% accuracy loss.
Installation :
pip install bitsandbytes transformers accelerate
8-bit quantization (50% memory reduction):
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=config,
device_map="auto"
)
# Memory: 14GB → 7GB
4-bit quantization (75% memory reduction):
config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=config,
device_map="auto"
)
# Memory: 14GB → 3.5GB
Copy this checklist:
Quantization Loading:
- [ ] Step 1: Calculate memory requirements
- [ ] Step 2: Choose quantization level (4-bit or 8-bit)
- [ ] Step 3: Configure quantization
- [ ] Step 4: Load and verify model
Step 1: Calculate memory requirements
Estimate model memory:
FP16 memory (GB) = Parameters × 2 bytes / 1e9
INT8 memory (GB) = Parameters × 1 byte / 1e9
INT4 memory (GB) = Parameters × 0.5 bytes / 1e9
Example (Llama 2 7B):
FP16: 7B × 2 / 1e9 = 14 GB
INT8: 7B × 1 / 1e9 = 7 GB
INT4: 7B × 0.5 / 1e9 = 3.5 GB
Step 2: Choose quantization level
| GPU VRAM | Model Size | Recommended |
|---|---|---|
| 8 GB | 3B | 4-bit |
| 12 GB | 7B | 4-bit |
| 16 GB | 7B | 8-bit or 4-bit |
| 24 GB | 13B | 8-bit or 70B 4-bit |
| 40+ GB | 70B | 8-bit |
Step 3: Configure quantization
For 8-bit (better accuracy):
from transformers import BitsAndBytesConfig
import torch
config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=6.0, # Outlier threshold
llm_int8_has_fp16_weight=False
)
For 4-bit (maximum memory savings):
config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16, # Compute in FP16
bnb_4bit_quant_type="nf4", # NormalFloat4 (recommended)
bnb_4bit_use_double_quant=True # Nested quantization
)
Step 4: Load and verify model
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-13b-hf",
quantization_config=config,
device_map="auto", # Automatic device placement
torch_dtype=torch.float16
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf")
# Test inference
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0]))
# Check memory
import torch
print(f"Memory allocated: {torch.cuda.memory_allocated()/1e9:.2f}GB")
QLoRA enables fine-tuning large models on consumer GPUs.
Copy this checklist:
QLoRA Fine-tuning:
- [ ] Step 1: Install dependencies
- [ ] Step 2: Configure 4-bit base model
- [ ] Step 3: Add LoRA adapters
- [ ] Step 4: Train with standard Trainer
Step 1: Install dependencies
pip install bitsandbytes transformers peft accelerate datasets
Step 2: Configure 4-bit base model
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto"
)
Step 3: Add LoRA adapters
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
# Prepare model for training
model = prepare_model_for_kbit_training(model)
# Configure LoRA
lora_config = LoraConfig(
r=16, # LoRA rank
lora_alpha=32, # LoRA alpha
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
# Add LoRA adapters
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 4.2M || all params: 6.7B || trainable%: 0.06%
Step 4: Train with standard Trainer
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./qlora-output",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
num_train_epochs=3,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
save_strategy="epoch"
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
tokenizer=tokenizer
)
trainer.train()
# Save LoRA adapters (only ~20MB)
model.save_pretrained("./qlora-adapters")
Use 8-bit Adam/AdamW to reduce optimizer memory by 75%.
8-bit Optimizer Setup:
- [ ] Step 1: Replace standard optimizer
- [ ] Step 2: Configure training
- [ ] Step 3: Monitor memory savings
Step 1: Replace standard optimizer
import bitsandbytes as bnb
from transformers import Trainer, TrainingArguments
# Instead of torch.optim.AdamW
model = AutoModelForCausalLM.from_pretrained("model-name")
training_args = TrainingArguments(
output_dir="./output",
per_device_train_batch_size=8,
optim="paged_adamw_8bit", # 8-bit optimizer
learning_rate=5e-5
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset
)
trainer.train()
Manual optimizer usage :
import bitsandbytes as bnb
optimizer = bnb.optim.AdamW8bit(
model.parameters(),
lr=1e-4,
betas=(0.9, 0.999),
eps=1e-8
)
# Training loop
for batch in dataloader:
loss = model(**batch).loss
loss.backward()
optimizer.step()
optimizer.zero_grad()
Step 2: Configure training
Compare memory:
Standard AdamW optimizer memory = model_params × 8 bytes (states)
8-bit AdamW memory = model_params × 2 bytes
Savings = 75% optimizer memory
Example (Llama 2 7B):
Standard: 7B × 8 = 56 GB
8-bit: 7B × 2 = 14 GB
Savings: 42 GB
Step 3: Monitor memory savings
import torch
before = torch.cuda.memory_allocated()
# Training step
optimizer.step()
after = torch.cuda.memory_allocated()
print(f"Memory used: {(after-before)/1e9:.2f}GB")
Use bitsandbytes when:
Use alternatives instead:
Issue: CUDA error during loading
Install matching CUDA version:
# Check CUDA version
nvcc --version
# Install matching bitsandbytes
pip install bitsandbytes --no-cache-dir
Issue: Model loading slow
Use CPU offload for large models:
model = AutoModelForCausalLM.from_pretrained(
"model-name",
quantization_config=config,
device_map="auto",
max_memory={0: "20GB", "cpu": "30GB"} # Offload to CPU
)
Issue: Lower accuracy than expected
Try 8-bit instead of 4-bit:
config = BitsAndBytesConfig(load_in_8bit=True)
# 8-bit has <0.5% accuracy loss vs 1-2% for 4-bit
Or use NF4 with double quantization:
config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # Better than fp4
bnb_4bit_use_double_quant=True # Extra accuracy
)
Issue: OOM even with 4-bit
Enable CPU offload:
model = AutoModelForCausalLM.from_pretrained(
"model-name",
quantization_config=config,
device_map="auto",
offload_folder="offload", # Disk offload
offload_state_dict=True
)
QLoRA training guide : See references/qlora-training.md for complete fine-tuning workflows, hyperparameter tuning, and multi-GPU training.
Quantization formats : See references/quantization-formats.md for INT8, NF4, FP4 comparison, double quantization, and custom quantization configs.
Memory optimization : See references/memory-optimization.md for CPU offloading strategies, gradient checkpointing, and memory profiling.
Supported platforms : NVIDIA GPUs (primary), AMD ROCm, Intel GPUs (experimental)
Weekly Installs
69
Repository
GitHub Stars
5.6K
First Seen
Feb 7, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
opencode60
codex59
cursor59
gemini-cli58
github-copilot57
claude-code56
超能力技能使用指南:AI助手技能调用优先级与工作流程详解
53,700 周安装