微调助手指南：LoRA/QLoRA模型微调全流程，提升AI任务性能

Fine-Tuning Assistant by jmsktm/claude-settings

2 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/jmsktm/claude-settings --skill 'Fine-Tuning Assistant'

AI/机器学习自动化自然语言处理

🇨🇳中文介绍

微调助手

微调助手技能将引导您完成将预训练模型适配到特定用例的过程。微调可以显著提升模型在专业任务上的性能，教会模型您偏好的风格，并添加仅靠提示无法实现的能力。

本技能涵盖何时选择微调而非提示工程、准备训练数据、选择基础模型、配置训练参数、评估结果以及部署微调模型。它应用了包括 LoRA、QLoRA 和指令调优在内的现代技术，使微调变得实用且经济高效。

无论您是通过 API 微调 GPT 模型、使用开源模型进行本地训练，还是使用 Hugging Face 等平台，本技能都能确保您以战略性和有效的方式进行微调。

核心工作流程

工作流程 1：决定是否进行微调

评估问题：
- 通过提示能否实现目标？
- 任务格式或风格是否一致？
- 您是否有高质量的训练数据？
- 这值得投入吗？
比较方法：方法 | 何时使用 | 投入
---|---|---
更好的提示 | 首次尝试，可变任务 | 低
少样本示例 | 格式一致，数据有限 | 低
RAG | 知识密集型，动态数据 | 中
微调 | 风格一致，专业任务 | 高
评估要求：
- 至少 100-1000 个高质量示例
- 清晰的评估标准
- 训练和托管的预算
决策：仅当提示/RAG 不足时才进行微调

工作流程 2：准备微调数据集

收集训练示例：
- 代表目标用例
- 高质量（输出无错误）
- 覆盖任务变体的多样性
格式化用于训练：

{"messages": [ {"role": "system", "content": "You are a helpful assistant..."}, {"role": "user", "content": "User input here"}, {"role": "assistant", "content": "Ideal response here"} ]}
：

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

工作流程 3：执行微调

选择基础模型：
- 考虑大小与能力的权衡
- 使模型与任务复杂度匹配
- 检查您的用例的许可
配置训练：

OpenAI fine-tuning

training_config = {

    "model": "gpt-4o-mini-2024-07-18",
    "training_file": "file-xxx",
    "hyperparameters": {
        "n_epochs": 3,
        "batch_size": "auto",
        "learning_rate_multiplier": "auto"
    }
}

# LoRA fine-tuning (local)
lora_config = {
    "r": 16,  # Rank
    "lora_alpha": 32,
    "lora_dropout": 0.05,
    "target_modules": ["q_proj", "v_proj"]
}

3. 监控训练： * 观察损失曲线 * 检查是否过拟合 * 在保留集上进行验证 4. 评估结果： * 与基线模型比较 * 在多样化输入上测试 * 检查是否有能力退化

操作	命令/触发
决定方法	"Should I fine-tune for [task]"
准备数据	"Format data for fine-tuning"
选择模型	"Which model to fine-tune for [task]"
配置训练	"Fine-tuning parameters for [goal]"
评估结果	"Evaluate fine-tuned model"
调试训练	"Fine-tuning loss not decreasing"

从提示开始：微调成本高昂；先尝试更便宜的方法
- 更好的提示能否实现 80% 的目标？
- 在提示中尝试少样本示例
- 对于知识任务，考虑 RAG
质量优于数量：100 个优秀示例胜过 10,000 个平庸示例
- 每个示例都应是黄金标准
- 最好由人工验证示例
- 删除任何您不希望模型学习的内容
格式与用例匹配：训练示例应反映实际使用情况
- 与生产环境相同的提示结构
- 现实的输入变体
- 明确覆盖边缘情况
不要过度训练：更多训练轮次并不总是更好
- 观察验证损失以防过拟合
- 从 1-3 个轮次开始
- 当验证损失趋于平稳时提前停止
正确评估：训练损失不是目标
- 使用保留的测试集
- 在相同测试上与基线比较
- 检查能力退化情况
- 明确测试边缘情况
版本化一切：微调是迭代过程
- 对训练数据进行版本控制
- 跟踪实验配置
- 记录有效和无效的方法

LoRA（低秩适配）

大型模型的高效微调：

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,                           # Rank of update matrices
    lora_alpha=32,                  # Scaling factor
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA to base model
model = get_peft_model(base_model, lora_config)

# Only ~0.1% of parameters are trainable
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

QLoRA（量化 LoRA）

在消费级硬件上微调大型模型：

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config
)

# Apply LoRA on top
model = get_peft_model(model, lora_config)

指令调优数据集创建

将原始数据转换为指令格式：

def create_instruction_example(raw_data):
    return {
        "messages": [
            {
                "role": "system",
                "content": "You are a customer service agent for TechCorp..."
            },
            {
                "role": "user",
                "content": f"Customer inquiry: {raw_data['inquiry']}"
            },
            {
                "role": "assistant",
                "content": raw_data['ideal_response']
            }
        ]
    }

# Apply to dataset
instruction_dataset = [create_instruction_example(d) for d in raw_dataset]

对微调模型进行全面评估：

def evaluate_fine_tuned_model(model, test_set, baseline_model=None):
    results = {
        "task_accuracy": [],
        "format_compliance": [],
        "style_match": [],
        "regression_check": []
    }

    for example in test_set:
        output = model.generate(example.input)

        # Task-specific accuracy
        results["task_accuracy"].append(
            check_correctness(output, example.expected)
        )

        # Format compliance
        results["format_compliance"].append(
            matches_expected_format(output)
        )

        # Style matching (for style transfer tasks)
        results["style_match"].append(
            style_similarity(output, example.expected)
        )

        # Regression on general capabilities
        if baseline_model:
            results["regression_check"].append(
                compare_general_capability(model, baseline_model, example)
            )

    return {k: np.mean(v) for k, v in results.items()}

按难度排序训练数据：

def create_curriculum(dataset):
    # Score examples by complexity
    scored = [(score_complexity(ex), ex) for ex in dataset]
    scored.sort(key=lambda x: x[0])

    # Create epochs with increasing difficulty
    n = len(scored)
    curriculum = {
        "epoch_1": [ex for _, ex in scored[:n//3]],           # Easy
        "epoch_2": [ex for _, ex in scored[:2*n//3]],         # Easy + Medium
        "epoch_3": [ex for _, ex in scored],                   # All
    }
    return curriculum

应避免的常见陷阱

在更好的提示就足够的情况下进行微调
使用低质量或不一致的训练示例
未保留适当的测试集
训练轮次过多（过拟合）
忽略微调导致的能力退化
未对训练数据和配置进行版本控制
期望微调能增加事实性知识（应改用 RAG）
在与生产使用不匹配的数据上进行微调

🇺🇸English

Fine-Tuning Assistant

The Fine-Tuning Assistant skill guides you through the process of adapting pre-trained models to your specific use case. Fine-tuning can dramatically improve model performance on specialized tasks, teach models your preferred style, and add capabilities that prompting alone cannot achieve.

This skill covers when to fine-tune versus prompt engineer, preparing training data, selecting base models, configuring training parameters, evaluating results, and deploying fine-tuned models. It applies modern techniques including LoRA, QLoRA, and instruction tuning to make fine-tuning practical and cost-effective.

Whether you are fine-tuning GPT models via API, running local training with open-source models, or using platforms like Hugging Face, this skill ensures you approach fine-tuning strategically and effectively.

Core Workflows

Workflow 1: Decide Whether to Fine-Tune

Assess the problem:
- Can prompting achieve the goal?
- Is the task format or style consistent?
- Do you have quality training data?
- Is this worth the investment?
Compare approaches: Approach | When to Use | Investment
---|---|---
Better prompts | First attempt, variable tasks | Low
Few-shot examples | Consistent format, limited data | Low
RAG | Knowledge-intensive, dynamic data | Medium
Fine-tuning | Consistent style, specialized task | High
Evaluate requirements:
- Minimum 100-1000 quality examples
- Clear evaluation criteria
- Budget for training and hosting
Decision : Fine-tune only if prompting/RAG insufficient

Workflow 2: Prepare Fine-Tuning Dataset

Collect training examples:
- Representative of target use case
- High quality (no errors in outputs)
- Diverse coverage of task variations
Format for training:

{"messages": [ {"role": "system", "content": "You are a helpful assistant..."}, {"role": "user", "content": "User input here"}, {"role": "assistant", "content": "Ideal response here"} ]}
Quality assurance :
- Review sample of examples manually
- Check for consistency in style/format
- Remove duplicates and low-quality entries
Split train/validation/test sets
Validate dataset format

Workflow 3: Execute Fine-Tuning

Select base model:
- Consider size vs capability tradeoff
- Match model to task complexity
- Check licensing for your use case
Configure training:

OpenAI fine-tuning

training_config = {

    "model": "gpt-4o-mini-2024-07-18",
    "training_file": "file-xxx",
    "hyperparameters": {
        "n_epochs": 3,
        "batch_size": "auto",
        "learning_rate_multiplier": "auto"
    }
}

# LoRA fine-tuning (local)
lora_config = {
    "r": 16,  # Rank
    "lora_alpha": 32,
    "lora_dropout": 0.05,
    "target_modules": ["q_proj", "v_proj"]
}

3. Monitor training: * Watch loss curves * Check for overfitting * Validate on held-out set 4. Evaluate results: * Compare to baseline model * Test on diverse inputs * Check for regressions

Quick Reference

Action	Command/Trigger
Decide approach	"Should I fine-tune for [task]"
Prepare data	"Format data for fine-tuning"
Choose model	"Which model to fine-tune for [task]"
Configure training	"Fine-tuning parameters for [goal]"
Evaluate results	"Evaluate fine-tuned model"
Debug training	"Fine-tuning loss not decreasing"

Best Practices

Start with Prompting : Fine-tuning is expensive; exhaust cheaper options first
- Can better prompts achieve 80% of the goal?
- Try few-shot examples in the prompt
- Consider RAG for knowledge tasks
Quality Over Quantity : 100 excellent examples beat 10,000 mediocre ones
- Each example should be a gold standard
- Better to have humans verify examples
- Remove anything you wouldn't want the model to learn
Match Format to Use Case : Training examples should mirror real usage
- Same prompt structure as production
- Realistic input variations
- Cover edge cases explicitly
Don't Over-Train : More epochs isn't always better
- Watch validation loss for overfitting
- Start with 1-3 epochs
- Early stopping when validation plateaus
Evaluate Properly : Training loss isn't the goal
- Use held-out test set
- Compare to baseline on same tests
- Check for capability regressions
- Test on edge cases explicitly
Version Everything : Fine-tuning is iterative
- Version your training data
- Track experiment configurations
- Document what worked and what didn't

Advanced Techniques

LoRA (Low-Rank Adaptation)

Efficient fine-tuning for large models:

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,                           # Rank of update matrices
    lora_alpha=32,                  # Scaling factor
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA to base model
model = get_peft_model(base_model, lora_config)

# Only ~0.1% of parameters are trainable
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

QLoRA (Quantized LoRA)

Fine-tune large models on consumer hardware:

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config
)

# Apply LoRA on top
model = get_peft_model(model, lora_config)

Instruction Tuning Dataset Creation

Convert raw data to instruction format:

def create_instruction_example(raw_data):
    return {
        "messages": [
            {
                "role": "system",
                "content": "You are a customer service agent for TechCorp..."
            },
            {
                "role": "user",
                "content": f"Customer inquiry: {raw_data['inquiry']}"
            },
            {
                "role": "assistant",
                "content": raw_data['ideal_response']
            }
        ]
    }

# Apply to dataset
instruction_dataset = [create_instruction_example(d) for d in raw_dataset]

Evaluation Framework

Comprehensive assessment of fine-tuned models:

def evaluate_fine_tuned_model(model, test_set, baseline_model=None):
    results = {
        "task_accuracy": [],
        "format_compliance": [],
        "style_match": [],
        "regression_check": []
    }

    for example in test_set:
        output = model.generate(example.input)

        # Task-specific accuracy
        results["task_accuracy"].append(
            check_correctness(output, example.expected)
        )

        # Format compliance
        results["format_compliance"].append(
            matches_expected_format(output)
        )

        # Style matching (for style transfer tasks)
        results["style_match"].append(
            style_similarity(output, example.expected)
        )

        # Regression on general capabilities
        if baseline_model:
            results["regression_check"].append(
                compare_general_capability(model, baseline_model, example)
            )

    return {k: np.mean(v) for k, v in results.items()}

Curriculum Learning

Order training data by difficulty:

def create_curriculum(dataset):
    # Score examples by complexity
    scored = [(score_complexity(ex), ex) for ex in dataset]
    scored.sort(key=lambda x: x[0])

    # Create epochs with increasing difficulty
    n = len(scored)
    curriculum = {
        "epoch_1": [ex for _, ex in scored[:n//3]],           # Easy
        "epoch_2": [ex for _, ex in scored[:2*n//3]],         # Easy + Medium
        "epoch_3": [ex for _, ex in scored],                   # All
    }
    return curriculum

Common Pitfalls to Avoid

Fine-tuning when better prompting would suffice
Using low-quality or inconsistent training examples
Not holding out a proper test set
Training for too many epochs (overfitting)
Ignoring capability regressions from fine-tuning
Not versioning training data and configurations
Expecting fine-tuning to add factual knowledge (use RAG instead)
Fine-tuning on data that doesn't match production use

Weekly Installs

–

Repository

jmsktm/claude-settings

GitHub Stars

First Seen

–

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

AI Elements：基于shadcn/ui的AI原生应用组件库，快速构建对话界面

60,400 周安装