大模型微调指南：LoRA/QLoRA技术详解、内存需求与最佳实践

finetuning by doanchienthangdev/omgkit

1 周安装量

3 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/doanchienthangdev/omgkit --skill finetuning

AI/机器学习性能优化自然语言处理

🇨🇳中文介绍

微调

针对特定任务调整基础模型。

何时进行微调

应该进行微调的情况

提升特定领域的质量
降低延迟（使用更小的模型）
降低成本（减少令牌使用）
确保一致的风格
添加专业化能力

不应该进行微调的情况

提示工程已足够
数据不足（少于1000个示例）
需要频繁更新
RAG 可以解决问题

内存需求

def training_memory_gb(num_params_billion, precision="fp16"):
    bytes_per = {"fp32": 4, "fp16": 2, "int8": 1}

    model = num_params_billion * 1e9 * bytes_per[precision]
    optimizer = num_params_billion * 1e9 * 4 * 2  # AdamW states
    gradients = num_params_billion * 1e9 * bytes_per[precision]

    return (model + optimizer + gradients) / 1e9

# 7B 模型全量微调：约 112 GB！
# 使用 LoRA：约 16 GB
# 使用 QLoRA：约 6 GB

LoRA（低秩适应）

from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8,                          # 秩（越低参数越少）
    lora_alpha=32,                # 缩放因子
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM"
)

model = get_peft_model(base_model, config)

# 仅训练约 7B 参数的 0.06%！
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

🇺🇸English

Finetuning

Adapting Foundation Models for specific tasks.

When to Finetune

DO Finetune

Improve quality on specific domain
Reduce latency (smaller model)
Reduce cost (fewer tokens)
Ensure consistent style
Add specialized capabilities

DON'T Finetune

Prompt engineering is enough
Insufficient data (<1000 examples)
Need frequent updates
RAG can solve the problem

Memory Requirements

def training_memory_gb(num_params_billion, precision="fp16"):
    bytes_per = {"fp32": 4, "fp16": 2, "int8": 1}

    model = num_params_billion * 1e9 * bytes_per[precision]
    optimizer = num_params_billion * 1e9 * 4 * 2  # AdamW states
    gradients = num_params_billion * 1e9 * bytes_per[precision]

    return (model + optimizer + gradients) / 1e9

# 7B model full finetuning: ~112 GB!
# With LoRA: ~16 GB
# With QLoRA: ~6 GB

LoRA (Low-Rank Adaptation)

from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8,                          # Rank (lower = fewer params)
    lora_alpha=32,                # Scaling factor
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM"
)

model = get_peft_model(base_model, config)

# ~0.06% of 7B trainable!
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)

QLoRA (4-bit + LoRA)

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto"
)

model = get_peft_model(model, lora_config)
# 7B on 16GB GPU!

Training

from transformers import Trainer, TrainingArguments

args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-5,
    warmup_steps=100,
    fp16=True,
    gradient_checkpointing=True,
    optim="paged_adamw_8bit"
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_data,
    eval_dataset=eval_data
)

trainer.train()

# Merge LoRA back
merged = model.merge_and_unload()
merged.save_pretrained("./finetuned")

Model Merging

Task Arithmetic

def task_vector_merge(base, finetuned_models, scale=0.3):
    merged = base.state_dict()
    for ft in finetuned_models:
        for key in merged:
            task_vector = ft.state_dict()[key] - merged[key]
            merged[key] += scale * task_vector
    return merged

Best Practices

Start with small rank (r=8)
Use QLoRA for limited GPU
Monitor validation loss
Test merged models carefully
Keep base model for comparison

Weekly Installs

Repository

doanchienthangdev/omgkit

GitHub Stars

First Seen

1 day ago

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

zencoder1

amp1

cline1

openclaw1

opencode1

cursor1