bitsandbytes LLM 量化指南：4位/8位模型内存优化，降低GPU需求

quantizing-models-bitsandbytes by davila7/claude-code-templates

166 周安装量

23,400 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/davila7/claude-code-templates --skill quantizing-models-bitsandbytes

AI/机器学习 PyTorch 性能优化

🇨🇳中文介绍

bitsandbytes - LLM 量化

快速开始

bitsandbytes 可将 LLM 内存占用减少 50%（8 位）或 75%（4 位），且精度损失小于 1%。

安装：

pip install bitsandbytes transformers accelerate

8 位量化（内存减少 50%）：

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=config,
    device_map="auto"
)

# 内存：14GB → 7GB

4 位量化（内存减少 75%）：

config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=config,
    device_map="auto"
)

# 内存：14GB → 3.5GB

常见工作流

工作流 1：在有限的 GPU 内存中加载大模型

复制此清单：

Quantization Loading:
- [ ] 步骤 1：计算内存需求
- [ ] 步骤 2：选择量化级别（4 位或 8 位）
- [ ] 步骤 3：配置量化
- [ ] 步骤 4：加载并验证模型

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

GPU 显存	模型大小	推荐方案
8 GB	3B	4 位
12 GB	7B	4 位
16 GB	7B	8 位或 4 位
24 GB	13B	8 位或 70B 4 位
40+ GB	70B	8 位

工作流 2：使用 QLoRA 进行微调（4 位训练）

QLoRA 支持在消费级 GPU 上微调大模型。

QLoRA Fine-tuning:
- [ ] 步骤 1：安装依赖项
- [ ] 步骤 2：配置 4 位基础模型
- [ ] 步骤 3：添加 LoRA 适配器
- [ ] 步骤 4：使用标准 Trainer 进行训练

步骤 1：安装依赖项

pip install bitsandbytes transformers peft accelerate datasets

步骤 2：配置 4 位基础模型

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

步骤 3：添加 LoRA 适配器

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# 准备模型以进行训练
model = prepare_model_for_kbit_training(model)

# 配置 LoRA
lora_config = LoraConfig(
    r=16,  # LoRA 秩
    lora_alpha=32,  # LoRA alpha
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# 添加 LoRA 适配器
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# 输出: trainable params: 4.2M || all params: 6.7B || trainable%: 0.06%

步骤 4：使用标准 Trainer 进行训练

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./qlora-output",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    tokenizer=tokenizer
)

trainer.train()

# 保存 LoRA 适配器（仅约 20MB）
model.save_pretrained("./qlora-adapters")

工作流 3：使用 8 位优化器进行内存高效训练

使用 8 位 Adam/AdamW 将优化器内存减少 75%。

8-bit Optimizer Setup:
- [ ] 步骤 1：替换标准优化器
- [ ] 步骤 2：配置训练
- [ ] 步骤 3：监控内存节省

步骤 1：替换标准优化器

import bitsandbytes as bnb
from transformers import Trainer, TrainingArguments

# 替代 torch.optim.AdamW
model = AutoModelForCausalLM.from_pretrained("model-name")

training_args = TrainingArguments(
    output_dir="./output",
    per_device_train_batch_size=8,
    optim="paged_adamw_8bit",  # 8 位优化器
    learning_rate=5e-5
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset
)

trainer.train()

手动使用优化器：

import bitsandbytes as bnb

optimizer = bnb.optim.AdamW8bit(
    model.parameters(),
    lr=1e-4,
    betas=(0.9, 0.999),
    eps=1e-8
)

# 训练循环
for batch in dataloader:
    loss = model(**batch).loss
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

步骤 2：配置训练

标准 AdamW 优化器内存 = 模型参数量 × 8 字节（状态）
8 位 AdamW 内存 = 模型参数量 × 2 字节
节省 = 75% 的优化器内存

示例 (Llama 2 7B):
标准: 7B × 8 = 56 GB
8 位: 7B × 2 = 14 GB
节省: 42 GB

步骤 3：监控内存节省

import torch

before = torch.cuda.memory_allocated()

# 训练步骤
optimizer.step()

after = torch.cuda.memory_allocated()
print(f"已使用内存: {(after-before)/1e9:.2f}GB")

使用时机与替代方案对比

在以下情况使用 bitsandbytes：

GPU 内存有限（需要加载更大模型）
使用 QLoRA 进行训练（在单 GPU 上微调 70B 模型）
仅进行推理（减少 50-75% 内存）
使用 HuggingFace Transformers
可接受 0-2% 的精度下降

在以下情况使用替代方案：

GPTQ/AWQ：生产环境服务（推理速度比 bitsandbytes 更快）
GGUF：CPU 推理（llama.cpp）
FP8：H100 GPU（硬件 FP8 更快）
全精度：精度要求严格，内存不受限制

问题：加载时出现 CUDA 错误

安装匹配的 CUDA 版本：

# 检查 CUDA 版本
nvcc --version

# 安装匹配的 bitsandbytes
pip install bitsandbytes --no-cache-dir

问题：模型加载缓慢

对大模型使用 CPU 卸载：

model = AutoModelForCausalLM.from_pretrained(
    "model-name",
    quantization_config=config,
    device_map="auto",
    max_memory={0: "20GB", "cpu": "30GB"}  # 卸载到 CPU
)

问题：精度低于预期

尝试使用 8 位而非 4 位：

config = BitsAndBytesConfig(load_in_8bit=True)
# 8 位精度损失 <0.5%，而 4 位为 1-2%

或使用 NF4 配合双重量化：

config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",  # 优于 fp4
    bnb_4bit_use_double_quant=True  # 额外精度
)

问题：即使使用 4 位也出现 OOM

启用 CPU 卸载：

model = AutoModelForCausalLM.from_pretrained(
    "model-name",
    quantization_config=config,
    device_map="auto",
    offload_folder="offload",  # 磁盘卸载
    offload_state_dict=True
)

QLoRA 训练指南：完整的微调工作流、超参数调优和多 GPU 训练，请参阅 references/qlora-training.md。

量化格式：INT8、NF4、FP4 对比、双重量化和自定义量化配置，请参阅 references/quantization-formats.md。

内存优化：CPU 卸载策略、梯度检查点和内存分析，请参阅 references/memory-optimization.md。

GPU：NVIDIA，计算能力 7.0+（Turing、Ampere、Hopper）
显存：取决于模型和量化级别
- 4 位 Llama 2 7B：4GB
- 4 位 Llama 2 13B：8GB
- 4 位 Llama 2 70B：24GB
CUDA：11.1+（推荐 12.0+）
PyTorch：2.0+

支持的平台：NVIDIA GPU（主要）、AMD ROCm、Intel GPU（实验性）

GitHub: https://github.com/bitsandbytes-foundation/bitsandbytes
HuggingFace 文档: https://huggingface.co/docs/transformers/quantization/bitsandbytes
QLoRA 论文: "QLoRA: Efficient Finetuning of Quantized LLMs" (2023)
LLM.int8() 论文: "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale" (2022)

🇺🇸English

bitsandbytes - LLM Quantization

Quick start

bitsandbytes reduces LLM memory by 50% (8-bit) or 75% (4-bit) with <1% accuracy loss.

Installation :

pip install bitsandbytes transformers accelerate

8-bit quantization (50% memory reduction):

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=config,
    device_map="auto"
)

# Memory: 14GB → 7GB

4-bit quantization (75% memory reduction):

config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=config,
    device_map="auto"
)

# Memory: 14GB → 3.5GB

Common workflows

Workflow 1: Load large model in limited GPU memory

Copy this checklist:

Quantization Loading:
- [ ] Step 1: Calculate memory requirements
- [ ] Step 2: Choose quantization level (4-bit or 8-bit)
- [ ] Step 3: Configure quantization
- [ ] Step 4: Load and verify model

Step 1: Calculate memory requirements

Estimate model memory:

FP16 memory (GB) = Parameters × 2 bytes / 1e9
INT8 memory (GB) = Parameters × 1 byte / 1e9
INT4 memory (GB) = Parameters × 0.5 bytes / 1e9

Example (Llama 2 7B):
FP16: 7B × 2 / 1e9 = 14 GB
INT8: 7B × 1 / 1e9 = 7 GB
INT4: 7B × 0.5 / 1e9 = 3.5 GB

Step 2: Choose quantization level

GPU VRAM	Model Size	Recommended
8 GB	3B	4-bit
12 GB	7B	4-bit
16 GB	7B	8-bit or 4-bit
24 GB	13B	8-bit or 70B 4-bit
40+ GB	70B	8-bit

Step 3: Configure quantization

For 8-bit (better accuracy):

from transformers import BitsAndBytesConfig
import torch

config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,  # Outlier threshold
    llm_int8_has_fp16_weight=False
)

For 4-bit (maximum memory savings):

config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,  # Compute in FP16
    bnb_4bit_quant_type="nf4",  # NormalFloat4 (recommended)
    bnb_4bit_use_double_quant=True  # Nested quantization
)

Step 4: Load and verify model

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-13b-hf",
    quantization_config=config,
    device_map="auto",  # Automatic device placement
    torch_dtype=torch.float16
)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf")

# Test inference
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0]))

# Check memory
import torch
print(f"Memory allocated: {torch.cuda.memory_allocated()/1e9:.2f}GB")

Workflow 2: Fine-tune with QLoRA (4-bit training)

QLoRA enables fine-tuning large models on consumer GPUs.

Copy this checklist:

QLoRA Fine-tuning:
- [ ] Step 1: Install dependencies
- [ ] Step 2: Configure 4-bit base model
- [ ] Step 3: Add LoRA adapters
- [ ] Step 4: Train with standard Trainer

Step 1: Install dependencies

pip install bitsandbytes transformers peft accelerate datasets

Step 2: Configure 4-bit base model

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

Step 3: Add LoRA adapters

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# Prepare model for training
model = prepare_model_for_kbit_training(model)

# Configure LoRA
lora_config = LoraConfig(
    r=16,  # LoRA rank
    lora_alpha=32,  # LoRA alpha
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Add LoRA adapters
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 4.2M || all params: 6.7B || trainable%: 0.06%

Step 4: Train with standard Trainer

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./qlora-output",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    tokenizer=tokenizer
)

trainer.train()

# Save LoRA adapters (only ~20MB)
model.save_pretrained("./qlora-adapters")

Workflow 3: 8-bit optimizer for memory-efficient training

Use 8-bit Adam/AdamW to reduce optimizer memory by 75%.

8-bit Optimizer Setup:
- [ ] Step 1: Replace standard optimizer
- [ ] Step 2: Configure training
- [ ] Step 3: Monitor memory savings

Step 1: Replace standard optimizer

import bitsandbytes as bnb
from transformers import Trainer, TrainingArguments

# Instead of torch.optim.AdamW
model = AutoModelForCausalLM.from_pretrained("model-name")

training_args = TrainingArguments(
    output_dir="./output",
    per_device_train_batch_size=8,
    optim="paged_adamw_8bit",  # 8-bit optimizer
    learning_rate=5e-5
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset
)

trainer.train()

Manual optimizer usage :

import bitsandbytes as bnb

optimizer = bnb.optim.AdamW8bit(
    model.parameters(),
    lr=1e-4,
    betas=(0.9, 0.999),
    eps=1e-8
)

# Training loop
for batch in dataloader:
    loss = model(**batch).loss
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

Step 2: Configure training

Compare memory:

Standard AdamW optimizer memory = model_params × 8 bytes (states)
8-bit AdamW memory = model_params × 2 bytes
Savings = 75% optimizer memory

Example (Llama 2 7B):
Standard: 7B × 8 = 56 GB
8-bit: 7B × 2 = 14 GB
Savings: 42 GB

Step 3: Monitor memory savings

import torch

before = torch.cuda.memory_allocated()

# Training step
optimizer.step()

after = torch.cuda.memory_allocated()
print(f"Memory used: {(after-before)/1e9:.2f}GB")

When to use vs alternatives

Use bitsandbytes when:

GPU memory limited (need to fit larger model)
Training with QLoRA (fine-tune 70B on single GPU)
Inference only (50-75% memory reduction)
Using HuggingFace Transformers
Acceptable 0-2% accuracy degradation

Use alternatives instead:

GPTQ/AWQ : Production serving (faster inference than bitsandbytes)
GGUF : CPU inference (llama.cpp)
FP8 : H100 GPUs (hardware FP8 faster)
Full precision : Accuracy critical, memory not constrained

Common issues

Issue: CUDA error during loading

Install matching CUDA version:

# Check CUDA version
nvcc --version

# Install matching bitsandbytes
pip install bitsandbytes --no-cache-dir

Issue: Model loading slow

Use CPU offload for large models:

model = AutoModelForCausalLM.from_pretrained(
    "model-name",
    quantization_config=config,
    device_map="auto",
    max_memory={0: "20GB", "cpu": "30GB"}  # Offload to CPU
)

Issue: Lower accuracy than expected

Try 8-bit instead of 4-bit:

config = BitsAndBytesConfig(load_in_8bit=True)
# 8-bit has <0.5% accuracy loss vs 1-2% for 4-bit

Or use NF4 with double quantization:

config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",  # Better than fp4
    bnb_4bit_use_double_quant=True  # Extra accuracy
)

Issue: OOM even with 4-bit

Enable CPU offload:

model = AutoModelForCausalLM.from_pretrained(
    "model-name",
    quantization_config=config,
    device_map="auto",
    offload_folder="offload",  # Disk offload
    offload_state_dict=True
)

Advanced topics

QLoRA training guide : See references/qlora-training.md for complete fine-tuning workflows, hyperparameter tuning, and multi-GPU training.

Quantization formats : See references/quantization-formats.md for INT8, NF4, FP4 comparison, double quantization, and custom quantization configs.

Memory optimization : See references/memory-optimization.md for CPU offloading strategies, gradient checkpointing, and memory profiling.

Hardware requirements

GPU : NVIDIA with compute capability 7.0+ (Turing, Ampere, Hopper)
VRAM : Depends on model and quantization
- 4-bit Llama 2 7B: 4GB
- 4-bit Llama 2 13B: 8GB
- 4-bit Llama 2 70B: 24GB
CUDA : 11.1+ (12.0+ recommended)
PyTorch : 2.0+

Supported platforms : NVIDIA GPUs (primary), AMD ROCm, Intel GPUs (experimental)

Resources

GitHub: https://github.com/bitsandbytes-foundation/bitsandbytes
HuggingFace docs: https://huggingface.co/docs/transformers/quantization/bitsandbytes
QLoRA paper: "QLoRA: Efficient Finetuning of Quantized LLMs" (2023)
LLM.int8() paper: "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale" (2022)

Weekly Installs

166

Repository

davila7/claude-…emplates

GitHub Stars

23.4K

First Seen

Jan 21, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

opencode135

claude-code134

gemini-cli128

cursor126

codex117

github-copilot112

超能力技能使用指南：AI助手技能调用优先级与工作流程详解

46,500 周安装