fine-tuning-with-trl by davila7/claude-code-templates
npx skills add https://github.com/davila7/claude-code-templates --skill fine-tuning-with-trlTRL 提供了用于将语言模型与人类偏好对齐的后训练方法。
安装:
pip install trl transformers datasets peft accelerate
监督式微调(指令调优):
from trl import SFTTrainer
trainer = SFTTrainer(
model="Qwen/Qwen2.5-0.5B",
train_dataset=dataset, # 提示-完成对
)
trainer.train()
DPO(与偏好对齐):
from trl import DPOTrainer, DPOConfig
config = DPOConfig(output_dir="model-dpo", beta=0.1)
trainer = DPOTrainer(
model=model,
args=config,
train_dataset=preference_dataset, # 选中/拒绝对
processing_class=tokenizer
)
trainer.train()
从基础模型到与人类对齐的模型的完整流程。
复制此清单:
RLHF 训练:
- [ ] 步骤 1:监督式微调(SFT)
- [ ] 步骤 2:训练奖励模型
- [ ] 步骤 3:PPO 强化学习
- [ ] 步骤 4:评估对齐模型
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
步骤 1:监督式微调
在遵循指令的数据上训练基础模型:
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
# 加载模型
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
# 加载指令数据集
dataset = load_dataset("trl-lib/Capybara", split="train")
# 配置训练
training_args = SFTConfig(
output_dir="Qwen2.5-0.5B-SFT",
per_device_train_batch_size=4,
num_train_epochs=1,
learning_rate=2e-5,
logging_steps=10,
save_strategy="epoch"
)
# 训练
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer
)
trainer.train()
trainer.save_model()
步骤 2:训练奖励模型
训练模型以预测人类偏好:
from transformers import AutoModelForSequenceClassification
from trl import RewardTrainer, RewardConfig
# 加载 SFT 模型作为基础
model = AutoModelForSequenceClassification.from_pretrained(
"Qwen2.5-0.5B-SFT",
num_labels=1 # 单一奖励分数
)
tokenizer = AutoTokenizer.from_pretrained("Qwen2.5-0.5B-SFT")
# 加载偏好数据(选中/拒绝对)
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
# 配置训练
training_args = RewardConfig(
output_dir="Qwen2.5-0.5B-Reward",
per_device_train_batch_size=2,
num_train_epochs=1,
learning_rate=1e-5
)
# 训练奖励模型
trainer = RewardTrainer(
model=model,
args=training_args,
processing_class=tokenizer,
train_dataset=dataset
)
trainer.train()
trainer.save_model()
步骤 3:PPO 强化学习
使用奖励模型优化策略:
python -m trl.scripts.ppo \
--model_name_or_path Qwen2.5-0.5B-SFT \
--reward_model_path Qwen2.5-0.5B-Reward \
--dataset_name trl-internal-testing/descriptiveness-sentiment-trl-style \
--output_dir Qwen2.5-0.5B-PPO \
--learning_rate 3e-6 \
--per_device_train_batch_size 64 \
--total_episodes 10000
步骤 4:评估
from transformers import pipeline
# 加载对齐模型
generator = pipeline("text-generation", model="Qwen2.5-0.5B-PPO")
# 测试
prompt = "Explain quantum computing to a 10-year-old"
output = generator(prompt, max_length=200)[0]["generated_text"]
print(output)
无需奖励模型,将模型与偏好对齐。
复制此清单:
DPO 训练:
- [ ] 步骤 1:准备偏好数据集
- [ ] 步骤 2:配置 DPO
- [ ] 步骤 3:使用 DPOTrainer 训练
- [ ] 步骤 4:评估对齐效果
步骤 1:准备偏好数据集
数据集格式:
{
"prompt": "What is the capital of France?",
"chosen": "The capital of France is Paris.",
"rejected": "I don't know."
}
加载数据集:
from datasets import load_dataset
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
# 或加载你自己的
# dataset = load_dataset("json", data_files="preferences.json")
步骤 2:配置 DPO
from trl import DPOConfig
config = DPOConfig(
output_dir="Qwen2.5-0.5B-DPO",
per_device_train_batch_size=4,
num_train_epochs=1,
learning_rate=5e-7,
beta=0.1, # KL 惩罚强度
max_prompt_length=512,
max_length=1024,
logging_steps=10
)
步骤 3:使用 DPOTrainer 训练
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOTrainer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
trainer = DPOTrainer(
model=model,
args=config,
train_dataset=dataset,
processing_class=tokenizer
)
trainer.train()
trainer.save_model()
CLI 替代方案:
trl dpo \
--model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
--dataset_name argilla/Capybara-Preferences \
--output_dir Qwen2.5-0.5B-DPO \
--per_device_train_batch_size 4 \
--learning_rate 5e-7 \
--beta 0.1
使用最少内存进行强化学习训练。
复制此清单:
GRPO 训练:
- [ ] 步骤 1:定义奖励函数
- [ ] 步骤 2:配置 GRPO
- [ ] 步骤 3:使用 GRPOTrainer 训练
步骤 1:定义奖励函数
def reward_function(completions, **kwargs):
"""
计算完成文本的奖励。
参数:
completions: 生成的文本列表
返回:
奖励分数列表(浮点数)
"""
rewards = []
for completion in completions:
# 示例:基于长度和独特单词的奖励
score = len(completion.split()) # 偏好更长的回复
score += len(set(completion.lower().split())) # 奖励独特单词
rewards.append(score)
return rewards
或使用奖励模型:
from transformers import pipeline
reward_model = pipeline("text-classification", model="reward-model-path")
def reward_from_model(completions, prompts, **kwargs):
# 组合提示 + 完成文本
full_texts = [p + c for p, c in zip(prompts, completions)]
# 获取奖励分数
results = reward_model(full_texts)
return [r["score"] for r in results]
步骤 2:配置 GRPO
from trl import GRPOConfig
config = GRPOConfig(
output_dir="Qwen2-GRPO",
per_device_train_batch_size=4,
num_train_epochs=1,
learning_rate=1e-5,
num_generations=4, # 每个提示生成 4 个完成文本
max_new_tokens=128
)
步骤 3:使用 GRPOTrainer 训练
from datasets import load_dataset
from trl import GRPOTrainer
# 加载仅包含提示的数据集
dataset = load_dataset("trl-lib/tldr", split="train")
trainer = GRPOTrainer(
model="Qwen/Qwen2-0.5B-Instruct",
reward_funcs=reward_function, # 你的奖励函数
args=config,
train_dataset=dataset
)
trainer.train()
CLI:
trl grpo \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--dataset_name trl-lib/tldr \
--output_dir Qwen2-GRPO \
--num_generations 4
在以下情况下使用 TRL:
方法选择:
改用替代方案:
问题:DPO 训练期间内存溢出(OOM)
减少批次大小和序列长度:
config = DPOConfig(
per_device_train_batch_size=1, # 从 4 减少
max_length=512, # 从 1024 减少
gradient_accumulation_steps=8 # 保持有效批次大小
)
或使用梯度检查点:
model.gradient_checkpointing_enable()
问题:对齐质量差
调整 beta 参数:
# 更高的 beta = 更保守(更接近参考模型)
config = DPOConfig(beta=0.5) # 默认 0.1
# 更低的 beta = 更激进的对齐
config = DPOConfig(beta=0.01)
问题:奖励模型未学习
检查损失类型和学习率:
config = RewardConfig(
learning_rate=1e-5, # 尝试不同的学习率
num_train_epochs=3 # 训练更长时间
)
确保偏好数据集有明确的优胜者:
# 验证数据集
print(dataset[0])
# 应有明确的 chosen > rejected
问题:PPO 训练不稳定
调整 KL 系数:
config = PPOConfig(
kl_coef=0.1, # 从 0.05 增加
cliprange=0.1 # 从 0.2 减少
)
SFT 训练指南:有关数据集格式、聊天模板、打包策略和多 GPU 训练,请参阅 references/sft-training.md。
DPO 变体:有关 IPO、cDPO、RPO 和其他 DPO 损失函数及推荐超参数,请参阅 references/dpo-variants.md。
奖励建模:有关结果奖励与过程奖励、Bradley-Terry 损失和奖励模型评估,请参阅 references/reward-modeling.md。
在线 RL 方法:有关 PPO、GRPO、RLOO 和 OnlineDPO 的详细配置,请参阅 references/online-rl.md。
accelerate 支持内存优化:
每周安装次数
159
仓库
GitHub 星标数
22.6K
首次出现
2026 年 1 月 21 日
安全审计
安装于
claude-code129
opencode127
gemini-cli117
cursor116
codex106
antigravity104
TRL provides post-training methods for aligning language models with human preferences.
Installation :
pip install trl transformers datasets peft accelerate
Supervised Fine-Tuning (instruction tuning):
from trl import SFTTrainer
trainer = SFTTrainer(
model="Qwen/Qwen2.5-0.5B",
train_dataset=dataset, # Prompt-completion pairs
)
trainer.train()
DPO (align with preferences):
from trl import DPOTrainer, DPOConfig
config = DPOConfig(output_dir="model-dpo", beta=0.1)
trainer = DPOTrainer(
model=model,
args=config,
train_dataset=preference_dataset, # chosen/rejected pairs
processing_class=tokenizer
)
trainer.train()
Complete pipeline from base model to human-aligned model.
Copy this checklist:
RLHF Training:
- [ ] Step 1: Supervised fine-tuning (SFT)
- [ ] Step 2: Train reward model
- [ ] Step 3: PPO reinforcement learning
- [ ] Step 4: Evaluate aligned model
Step 1: Supervised fine-tuning
Train base model on instruction-following data:
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
# Load model
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
# Load instruction dataset
dataset = load_dataset("trl-lib/Capybara", split="train")
# Configure training
training_args = SFTConfig(
output_dir="Qwen2.5-0.5B-SFT",
per_device_train_batch_size=4,
num_train_epochs=1,
learning_rate=2e-5,
logging_steps=10,
save_strategy="epoch"
)
# Train
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer
)
trainer.train()
trainer.save_model()
Step 2: Train reward model
Train model to predict human preferences:
from transformers import AutoModelForSequenceClassification
from trl import RewardTrainer, RewardConfig
# Load SFT model as base
model = AutoModelForSequenceClassification.from_pretrained(
"Qwen2.5-0.5B-SFT",
num_labels=1 # Single reward score
)
tokenizer = AutoTokenizer.from_pretrained("Qwen2.5-0.5B-SFT")
# Load preference data (chosen/rejected pairs)
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
# Configure training
training_args = RewardConfig(
output_dir="Qwen2.5-0.5B-Reward",
per_device_train_batch_size=2,
num_train_epochs=1,
learning_rate=1e-5
)
# Train reward model
trainer = RewardTrainer(
model=model,
args=training_args,
processing_class=tokenizer,
train_dataset=dataset
)
trainer.train()
trainer.save_model()
Step 3: PPO reinforcement learning
Optimize policy using reward model:
python -m trl.scripts.ppo \
--model_name_or_path Qwen2.5-0.5B-SFT \
--reward_model_path Qwen2.5-0.5B-Reward \
--dataset_name trl-internal-testing/descriptiveness-sentiment-trl-style \
--output_dir Qwen2.5-0.5B-PPO \
--learning_rate 3e-6 \
--per_device_train_batch_size 64 \
--total_episodes 10000
Step 4: Evaluate
from transformers import pipeline
# Load aligned model
generator = pipeline("text-generation", model="Qwen2.5-0.5B-PPO")
# Test
prompt = "Explain quantum computing to a 10-year-old"
output = generator(prompt, max_length=200)[0]["generated_text"]
print(output)
Align model with preferences without reward model.
Copy this checklist:
DPO Training:
- [ ] Step 1: Prepare preference dataset
- [ ] Step 2: Configure DPO
- [ ] Step 3: Train with DPOTrainer
- [ ] Step 4: Evaluate alignment
Step 1: Prepare preference dataset
Dataset format:
{
"prompt": "What is the capital of France?",
"chosen": "The capital of France is Paris.",
"rejected": "I don't know."
}
Load dataset:
from datasets import load_dataset
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
# Or load your own
# dataset = load_dataset("json", data_files="preferences.json")
Step 2: Configure DPO
from trl import DPOConfig
config = DPOConfig(
output_dir="Qwen2.5-0.5B-DPO",
per_device_train_batch_size=4,
num_train_epochs=1,
learning_rate=5e-7,
beta=0.1, # KL penalty strength
max_prompt_length=512,
max_length=1024,
logging_steps=10
)
Step 3: Train with DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOTrainer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
trainer = DPOTrainer(
model=model,
args=config,
train_dataset=dataset,
processing_class=tokenizer
)
trainer.train()
trainer.save_model()
CLI alternative :
trl dpo \
--model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
--dataset_name argilla/Capybara-Preferences \
--output_dir Qwen2.5-0.5B-DPO \
--per_device_train_batch_size 4 \
--learning_rate 5e-7 \
--beta 0.1
Train with reinforcement learning using minimal memory.
Copy this checklist:
GRPO Training:
- [ ] Step 1: Define reward function
- [ ] Step 2: Configure GRPO
- [ ] Step 3: Train with GRPOTrainer
Step 1: Define reward function
def reward_function(completions, **kwargs):
"""
Compute rewards for completions.
Args:
completions: List of generated texts
Returns:
List of reward scores (floats)
"""
rewards = []
for completion in completions:
# Example: reward based on length and unique words
score = len(completion.split()) # Favor longer responses
score += len(set(completion.lower().split())) # Reward unique words
rewards.append(score)
return rewards
Or use a reward model:
from transformers import pipeline
reward_model = pipeline("text-classification", model="reward-model-path")
def reward_from_model(completions, prompts, **kwargs):
# Combine prompt + completion
full_texts = [p + c for p, c in zip(prompts, completions)]
# Get reward scores
results = reward_model(full_texts)
return [r["score"] for r in results]
Step 2: Configure GRPO
from trl import GRPOConfig
config = GRPOConfig(
output_dir="Qwen2-GRPO",
per_device_train_batch_size=4,
num_train_epochs=1,
learning_rate=1e-5,
num_generations=4, # Generate 4 completions per prompt
max_new_tokens=128
)
Step 3: Train with GRPOTrainer
from datasets import load_dataset
from trl import GRPOTrainer
# Load prompt-only dataset
dataset = load_dataset("trl-lib/tldr", split="train")
trainer = GRPOTrainer(
model="Qwen/Qwen2-0.5B-Instruct",
reward_funcs=reward_function, # Your reward function
args=config,
train_dataset=dataset
)
trainer.train()
CLI :
trl grpo \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--dataset_name trl-lib/tldr \
--output_dir Qwen2-GRPO \
--num_generations 4
Use TRL when:
Method selection :
Use alternatives instead:
Issue: OOM during DPO training
Reduce batch size and sequence length:
config = DPOConfig(
per_device_train_batch_size=1, # Reduce from 4
max_length=512, # Reduce from 1024
gradient_accumulation_steps=8 # Maintain effective batch
)
Or use gradient checkpointing:
model.gradient_checkpointing_enable()
Issue: Poor alignment quality
Tune beta parameter:
# Higher beta = more conservative (stays closer to reference)
config = DPOConfig(beta=0.5) # Default 0.1
# Lower beta = more aggressive alignment
config = DPOConfig(beta=0.01)
Issue: Reward model not learning
Check loss type and learning rate:
config = RewardConfig(
learning_rate=1e-5, # Try different LR
num_train_epochs=3 # Train longer
)
Ensure preference dataset has clear winners:
# Verify dataset
print(dataset[0])
# Should have clear chosen > rejected
Issue: PPO training unstable
Adjust KL coefficient:
config = PPOConfig(
kl_coef=0.1, # Increase from 0.05
cliprange=0.1 # Reduce from 0.2
)
SFT training guide : See references/sft-training.md for dataset formats, chat templates, packing strategies, and multi-GPU training.
DPO variants : See references/dpo-variants.md for IPO, cDPO, RPO, and other DPO loss functions with recommended hyperparameters.
Reward modeling : See references/reward-modeling.md for outcome vs process rewards, Bradley-Terry loss, and reward model evaluation.
Online RL methods : See references/online-rl.md for PPO, GRPO, RLOO, and OnlineDPO with detailed configurations.
accelerateMemory optimization :
Weekly Installs
159
Repository
GitHub Stars
22.6K
First Seen
Jan 21, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
claude-code129
opencode127
gemini-cli117
cursor116
codex106
antigravity104
超能力技能使用指南:AI助手技能调用优先级与工作流程详解
46,500 周安装