grpo-rl-training by davila7/claude-code-templates
npx skills add https://github.com/davila7/claude-code-templates --skill grpo-rl-training关于使用 Transformer Reinforcement Learning (TRL) 库实现组相对策略优化 (GRPO) 的专家级指南。本技能提供了经过实战检验的模式、关键见解以及用于使用自定义奖励函数微调语言模型的生产就绪工作流。
在以下情况下使用 GRPO 训练:
不要将 GRPO 用于:
关键机制:
与 PPO 的关键区别:
数学直觉:
For each prompt p:
1. Generate N completions: {c₁, c₂, ..., cₙ}
2. Compute rewards: {r₁, r₂, ..., rₙ}
3. Learn to increase probability of high-reward completions
relative to low-reward ones in the same group
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
黄金法则:
奖励函数类型:
| 类型 | 用例 | 示例权重 |
|---|---|---|
| 正确性 | 可验证任务(数学、代码) | 2.0 (最高) |
| 格式 | 严格的结构强制执行 | 0.5-1.0 |
| 长度 | 鼓励详细/简洁 | 0.1-0.5 |
| 风格 | 惩罚不需要的模式 | -0.5 到 0.5 |
关键要求:
示例结构:
from datasets import load_dataset, Dataset
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
[Your step-by-step thinking]
</reasoning>
<answer>
[Final answer]
</answer>
"""
def prepare_dataset(raw_data):
"""
Transform raw data into GRPO-compatible format.
Returns: Dataset with columns:
- 'prompt': List[Dict] with role/content (system + user messages)
- 'answer': str (ground truth, optional but recommended)
"""
return raw_data.map(lambda x: {
'prompt': [
{'role': 'system', 'content': SYSTEM_PROMPT},
{'role': 'user', 'content': x['question']}
],
'answer': extract_answer(x['raw_answer'])
})
专业提示:
模板结构:
def reward_function_name(
prompts, # List[List[Dict]]: Original prompts
completions, # List[List[Dict]]: Model generations
answer=None, # Optional: Ground truth from dataset
**kwargs # Additional dataset columns
) -> list[float]:
"""
Evaluate completions and return rewards.
Returns: List of floats (one per completion)
"""
# Extract completion text
responses = [comp[0]['content'] for comp in completions]
# Compute rewards
rewards = []
for response in responses:
score = compute_score(response)
rewards.append(score)
return rewards
示例 1:正确性奖励(数学/编码)
def correctness_reward(prompts, completions, answer, **kwargs):
"""Reward correct answers with high score."""
responses = [comp[0]['content'] for comp in completions]
extracted = [extract_final_answer(r) for r in responses]
return [2.0 if ans == gt else 0.0
for ans, gt in zip(extracted, answer)]
示例 2:格式奖励(结构化输出)
import re
def format_reward(completions, **kwargs):
"""Reward XML-like structured format."""
pattern = r'<reasoning>.*?</reasoning>\s*<answer>.*?</answer>'
responses = [comp[0]['content'] for comp in completions]
return [1.0 if re.search(pattern, r, re.DOTALL) else 0.0
for r in responses]
示例 3:增量格式奖励(部分奖励)
def incremental_format_reward(completions, **kwargs):
"""Award partial credit for format compliance."""
responses = [comp[0]['content'] for comp in completions]
rewards = []
for r in responses:
score = 0.0
if '<reasoning>' in r:
score += 0.25
if '</reasoning>' in r:
score += 0.25
if '<answer>' in r:
score += 0.25
if '</answer>' in r:
score += 0.25
# Penalize extra text after closing tag
if r.count('</answer>') == 1:
extra_text = r.split('</answer>')[-1].strip()
score -= len(extra_text) * 0.001
rewards.append(score)
return rewards
关键见解: 结合 3-5 个奖励函数以实现稳健的训练。信号的多样性比顺序更重要。
内存优化配置(小 GPU)
from trl import GRPOConfig
training_args = GRPOConfig(
output_dir="outputs/grpo-model",
# Learning rate
learning_rate=5e-6, # Lower = more stable
adam_beta1=0.9,
adam_beta2=0.99,
weight_decay=0.1,
warmup_ratio=0.1,
lr_scheduler_type='cosine',
# Batch settings
per_device_train_batch_size=1,
gradient_accumulation_steps=4, # Effective batch = 4
# GRPO-specific
num_generations=8, # Group size: 8-16 recommended
max_prompt_length=256,
max_completion_length=512,
# Training duration
num_train_epochs=1,
max_steps=None, # Or set fixed steps (e.g., 500)
# Optimization
bf16=True, # Faster on A100/H100
optim="adamw_8bit", # Memory-efficient optimizer
max_grad_norm=0.1,
# Logging
logging_steps=1,
save_steps=100,
report_to="wandb", # Or "none" for no logging
)
高性能配置(大 GPU)
training_args = GRPOConfig(
output_dir="outputs/grpo-model",
learning_rate=1e-5,
per_device_train_batch_size=4,
gradient_accumulation_steps=2,
num_generations=16, # Larger groups = better signal
max_prompt_length=512,
max_completion_length=1024,
num_train_epochs=1,
bf16=True,
use_vllm=True, # Fast generation with vLLM
logging_steps=10,
)
关键超参数:
| 参数 | 影响 | 调优建议 |
|---|---|---|
num_generations | 用于比较的组大小 | 从 8 开始,如果 GPU 允许则增加到 16 |
learning_rate | 收敛速度/稳定性 | 5e-6 (安全), 1e-5 (更快,风险更高) |
max_completion_length | 输出详细程度 | 匹配您的任务(推理用 512,简短答案用 256) |
gradient_accumulation_steps | 有效批次大小 | 如果 GPU 内存有限则增加 |
标准设置 (Transformers)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig
from trl import GRPOTrainer
# Load model
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2", # 2-3x faster
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# Optional: LoRA for parameter-efficient training
peft_config = LoraConfig(
r=16, # Rank (higher = more capacity)
lora_alpha=32, # Scaling factor (typically 2*r)
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
],
task_type="CAUSAL_LM",
lora_dropout=0.05,
)
# Initialize trainer
trainer = GRPOTrainer(
model=model,
processing_class=tokenizer,
reward_funcs=[
incremental_format_reward,
format_reward,
correctness_reward,
],
args=training_args,
train_dataset=dataset,
peft_config=peft_config, # Remove for full fine-tuning
)
# Train
trainer.train()
# Save
trainer.save_model("final_model")
Unsloth 设置(2-3 倍更快)
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="google/gemma-3-1b-it",
max_seq_length=1024,
load_in_4bit=True,
fast_inference=True,
max_lora_rank=32,
)
model = FastLanguageModel.get_peft_model(
model,
r=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=32,
use_gradient_checkpointing="unsloth",
)
# Rest is identical to standard setup
trainer = GRPOTrainer(model=model, ...)
trainer.train()
需要关注的关键指标:
reward: 所有完成项的平均值reward_std: 组内多样性(应保持 > 0)kl: 与参考的 KL 散度(应适度增长)健康的训练模式:
Step Reward Reward_Std KL
100 0.5 0.3 0.02
200 0.8 0.25 0.05
300 1.2 0.2 0.08 ← Good progression
400 1.5 0.15 0.12
警告信号:
| 问题 | 症状 | 解决方案 |
|---|---|---|
| 模式坍缩 | 所有完成项相同 | 增加 num_generations,添加多样性惩罚 |
| 无学习 | 奖励持平 | 检查奖励函数逻辑,增加 LR |
| OOM 错误 | GPU 内存不足 | 减少 num_generations,启用梯度检查点 |
| 训练缓慢 | < 1 it/s | 启用 use_vllm=True,使用 Unsloth,减少序列长度 |
| 忽略格式 | 模型不遵循结构 | 增加格式奖励权重,添加增量奖励 |
对于复杂任务,分阶段训练:
# Stage 1: Format compliance (epochs=1)
trainer_stage1 = GRPOTrainer(
model=model,
reward_funcs=[incremental_format_reward, format_reward],
...
)
trainer_stage1.train()
# Stage 2: Correctness (epochs=1)
trainer_stage2 = GRPOTrainer(
model=model,
reward_funcs=[format_reward, correctness_reward],
...
)
trainer_stage2.train()
class AdaptiveReward:
def __init__(self, base_reward_func, initial_weight=1.0):
self.func = base_reward_func
self.weight = initial_weight
def __call__(self, *args, **kwargs):
rewards = self.func(*args, **kwargs)
return [r * self.weight for r in rewards]
def adjust_weight(self, success_rate):
"""Increase weight if model struggling, decrease if succeeding."""
if success_rate < 0.3:
self.weight *= 1.2
elif success_rate > 0.8:
self.weight *= 0.9
def load_custom_knowledge_base(csv_path):
"""Example: School communication platform docs."""
import pandas as pd
df = pd.read_csv(csv_path)
dataset = Dataset.from_pandas(df).map(lambda x: {
'prompt': [
{'role': 'system', 'content': CUSTOM_SYSTEM_PROMPT},
{'role': 'user', 'content': x['question']}
],
'answer': x['expert_answer']
})
return dataset
# Merge LoRA adapters into base model
if hasattr(trainer.model, 'merge_and_unload'):
merged_model = trainer.model.merge_and_unload()
merged_model.save_pretrained("production_model")
tokenizer.save_pretrained("production_model")
from transformers import pipeline
generator = pipeline(
"text-generation",
model="production_model",
tokenizer=tokenizer
)
result = generator(
[
{'role': 'system', 'content': SYSTEM_PROMPT},
{'role': 'user', 'content': "What is 15 + 27?"}
],
max_new_tokens=256,
do_sample=True,
temperature=0.7,
top_p=0.9
)
print(result[0]['generated_text'])
训练前:
训练期间:
训练后:
# Debug reward function
def debug_reward(completions, **kwargs):
responses = [comp[0]['content'] for comp in completions]
for i, r in enumerate(responses[:2]): # Print first 2
print(f"Response {i}: {r[:200]}...")
return [1.0] * len(responses) # Dummy rewards
# Test without training
trainer = GRPOTrainer(..., reward_funcs=[debug_reward])
trainer.generate_completions(dataset[:1]) # Generate without updating
官方文档:
示例仓库:
推荐阅读:
加载此技能后:
templates/ 目录中的模板作为起点examples/ 中的示例以获取特定任务的实现关键提醒:
此技能专为专家级实现而设计。初学者应在尝试 GRPO 之前从监督微调开始。
每周安装
142
仓库
GitHub 星标
22.6K
首次出现
Jan 21, 2026
安全审计
安装于
claude-code115
opencode113
cursor105
gemini-cli104
codex94
antigravity90
Expert-level guidance for implementing Group Relative Policy Optimization (GRPO) using the Transformer Reinforcement Learning (TRL) library. This skill provides battle-tested patterns, critical insights, and production-ready workflows for fine-tuning language models with custom reward functions.
Use GRPO training when you need to:
Do NOT use GRPO for:
Key Mechanism:
Critical Difference from PPO:
Mathematical Intuition:
For each prompt p:
1. Generate N completions: {c₁, c₂, ..., cₙ}
2. Compute rewards: {r₁, r₂, ..., rₙ}
3. Learn to increase probability of high-reward completions
relative to low-reward ones in the same group
Golden Rules:
Reward Function Types:
| Type | Use Case | Example Weight |
|---|---|---|
| Correctness | Verifiable tasks (math, code) | 2.0 (highest) |
| Format | Strict structure enforcement | 0.5-1.0 |
| Length | Encourage verbosity/conciseness | 0.1-0.5 |
| Style | Penalize unwanted patterns | -0.5 to 0.5 |
Critical Requirements:
Example Structure:
from datasets import load_dataset, Dataset
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
[Your step-by-step thinking]
</reasoning>
<answer>
[Final answer]
</answer>
"""
def prepare_dataset(raw_data):
"""
Transform raw data into GRPO-compatible format.
Returns: Dataset with columns:
- 'prompt': List[Dict] with role/content (system + user messages)
- 'answer': str (ground truth, optional but recommended)
"""
return raw_data.map(lambda x: {
'prompt': [
{'role': 'system', 'content': SYSTEM_PROMPT},
{'role': 'user', 'content': x['question']}
],
'answer': extract_answer(x['raw_answer'])
})
Pro Tips:
Template Structure:
def reward_function_name(
prompts, # List[List[Dict]]: Original prompts
completions, # List[List[Dict]]: Model generations
answer=None, # Optional: Ground truth from dataset
**kwargs # Additional dataset columns
) -> list[float]:
"""
Evaluate completions and return rewards.
Returns: List of floats (one per completion)
"""
# Extract completion text
responses = [comp[0]['content'] for comp in completions]
# Compute rewards
rewards = []
for response in responses:
score = compute_score(response)
rewards.append(score)
return rewards
Example 1: Correctness Reward (Math/Coding)
def correctness_reward(prompts, completions, answer, **kwargs):
"""Reward correct answers with high score."""
responses = [comp[0]['content'] for comp in completions]
extracted = [extract_final_answer(r) for r in responses]
return [2.0 if ans == gt else 0.0
for ans, gt in zip(extracted, answer)]
Example 2: Format Reward (Structured Output)
import re
def format_reward(completions, **kwargs):
"""Reward XML-like structured format."""
pattern = r'<reasoning>.*?</reasoning>\s*<answer>.*?</answer>'
responses = [comp[0]['content'] for comp in completions]
return [1.0 if re.search(pattern, r, re.DOTALL) else 0.0
for r in responses]
Example 3: Incremental Format Reward (Partial Credit)
def incremental_format_reward(completions, **kwargs):
"""Award partial credit for format compliance."""
responses = [comp[0]['content'] for comp in completions]
rewards = []
for r in responses:
score = 0.0
if '<reasoning>' in r:
score += 0.25
if '</reasoning>' in r:
score += 0.25
if '<answer>' in r:
score += 0.25
if '</answer>' in r:
score += 0.25
# Penalize extra text after closing tag
if r.count('</answer>') == 1:
extra_text = r.split('</answer>')[-1].strip()
score -= len(extra_text) * 0.001
rewards.append(score)
return rewards
Critical Insight: Combine 3-5 reward functions for robust training. Order matters less than diversity of signals.
Memory-Optimized Config (Small GPU)
from trl import GRPOConfig
training_args = GRPOConfig(
output_dir="outputs/grpo-model",
# Learning rate
learning_rate=5e-6, # Lower = more stable
adam_beta1=0.9,
adam_beta2=0.99,
weight_decay=0.1,
warmup_ratio=0.1,
lr_scheduler_type='cosine',
# Batch settings
per_device_train_batch_size=1,
gradient_accumulation_steps=4, # Effective batch = 4
# GRPO-specific
num_generations=8, # Group size: 8-16 recommended
max_prompt_length=256,
max_completion_length=512,
# Training duration
num_train_epochs=1,
max_steps=None, # Or set fixed steps (e.g., 500)
# Optimization
bf16=True, # Faster on A100/H100
optim="adamw_8bit", # Memory-efficient optimizer
max_grad_norm=0.1,
# Logging
logging_steps=1,
save_steps=100,
report_to="wandb", # Or "none" for no logging
)
High-Performance Config (Large GPU)
training_args = GRPOConfig(
output_dir="outputs/grpo-model",
learning_rate=1e-5,
per_device_train_batch_size=4,
gradient_accumulation_steps=2,
num_generations=16, # Larger groups = better signal
max_prompt_length=512,
max_completion_length=1024,
num_train_epochs=1,
bf16=True,
use_vllm=True, # Fast generation with vLLM
logging_steps=10,
)
Critical Hyperparameters:
| Parameter | Impact | Tuning Advice |
|---|---|---|
num_generations | Group size for comparison | Start with 8, increase to 16 if GPU allows |
learning_rate | Convergence speed/stability | 5e-6 (safe), 1e-5 (faster, riskier) |
max_completion_length | Output verbosity | Match your task (512 for reasoning, 256 for short answers) |
gradient_accumulation_steps | Effective batch size | Increase if GPU memory limited |
Standard Setup (Transformers)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig
from trl import GRPOTrainer
# Load model
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2", # 2-3x faster
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# Optional: LoRA for parameter-efficient training
peft_config = LoraConfig(
r=16, # Rank (higher = more capacity)
lora_alpha=32, # Scaling factor (typically 2*r)
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
],
task_type="CAUSAL_LM",
lora_dropout=0.05,
)
# Initialize trainer
trainer = GRPOTrainer(
model=model,
processing_class=tokenizer,
reward_funcs=[
incremental_format_reward,
format_reward,
correctness_reward,
],
args=training_args,
train_dataset=dataset,
peft_config=peft_config, # Remove for full fine-tuning
)
# Train
trainer.train()
# Save
trainer.save_model("final_model")
Unsloth Setup (2-3x Faster)
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="google/gemma-3-1b-it",
max_seq_length=1024,
load_in_4bit=True,
fast_inference=True,
max_lora_rank=32,
)
model = FastLanguageModel.get_peft_model(
model,
r=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=32,
use_gradient_checkpointing="unsloth",
)
# Rest is identical to standard setup
trainer = GRPOTrainer(model=model, ...)
trainer.train()
Key metrics to watch:
reward: Average across all completionsreward_std: Diversity within groups (should remain > 0)kl: KL divergence from reference (should grow moderately)Healthy Training Pattern:
Step Reward Reward_Std KL
100 0.5 0.3 0.02
200 0.8 0.25 0.05
300 1.2 0.2 0.08 ← Good progression
400 1.5 0.15 0.12
Warning Signs:
| Problem | Symptom | Solution |
|---|---|---|
| Mode collapse | All completions identical | Increase num_generations, add diversity penalty |
| No learning | Flat rewards | Check reward function logic, increase LR |
| OOM errors | GPU memory exceeded | Reduce num_generations, enable gradient checkpointing |
| Slow training | < 1 it/s | Enable use_vllm=True, use Unsloth, reduce seq length |
| Format ignored | Model doesn't follow structure | Increase format reward weight, add incremental rewards |
For complex tasks, train in stages:
# Stage 1: Format compliance (epochs=1)
trainer_stage1 = GRPOTrainer(
model=model,
reward_funcs=[incremental_format_reward, format_reward],
...
)
trainer_stage1.train()
# Stage 2: Correctness (epochs=1)
trainer_stage2 = GRPOTrainer(
model=model,
reward_funcs=[format_reward, correctness_reward],
...
)
trainer_stage2.train()
class AdaptiveReward:
def __init__(self, base_reward_func, initial_weight=1.0):
self.func = base_reward_func
self.weight = initial_weight
def __call__(self, *args, **kwargs):
rewards = self.func(*args, **kwargs)
return [r * self.weight for r in rewards]
def adjust_weight(self, success_rate):
"""Increase weight if model struggling, decrease if succeeding."""
if success_rate < 0.3:
self.weight *= 1.2
elif success_rate > 0.8:
self.weight *= 0.9
def load_custom_knowledge_base(csv_path):
"""Example: School communication platform docs."""
import pandas as pd
df = pd.read_csv(csv_path)
dataset = Dataset.from_pandas(df).map(lambda x: {
'prompt': [
{'role': 'system', 'content': CUSTOM_SYSTEM_PROMPT},
{'role': 'user', 'content': x['question']}
],
'answer': x['expert_answer']
})
return dataset
# Merge LoRA adapters into base model
if hasattr(trainer.model, 'merge_and_unload'):
merged_model = trainer.model.merge_and_unload()
merged_model.save_pretrained("production_model")
tokenizer.save_pretrained("production_model")
from transformers import pipeline
generator = pipeline(
"text-generation",
model="production_model",
tokenizer=tokenizer
)
result = generator(
[
{'role': 'system', 'content': SYSTEM_PROMPT},
{'role': 'user', 'content': "What is 15 + 27?"}
],
max_new_tokens=256,
do_sample=True,
temperature=0.7,
top_p=0.9
)
print(result[0]['generated_text'])
Before Training:
During Training:
After Training:
# Debug reward function
def debug_reward(completions, **kwargs):
responses = [comp[0]['content'] for comp in completions]
for i, r in enumerate(responses[:2]): # Print first 2
print(f"Response {i}: {r[:200]}...")
return [1.0] * len(responses) # Dummy rewards
# Test without training
trainer = GRPOTrainer(..., reward_funcs=[debug_reward])
trainer.generate_completions(dataset[:1]) # Generate without updating
Official Documentation:
Example Repositories:
Recommended Reading:
When this skill is loaded:
templates/ directory as starting pointsexamples/ for task-specific implementationsCritical Reminders:
This skill is designed for expert-level implementation. Beginners should start with supervised fine-tuning before attempting GRPO.
Weekly Installs
142
Repository
GitHub Stars
22.6K
First Seen
Jan 21, 2026
Security Audits
Gen Agent Trust HubWarnSocketPassSnykFail
Installed on
claude-code115
opencode113
cursor105
gemini-cli104
codex94
antigravity90
超能力技能使用指南:AI助手技能调用优先级与工作流程详解
46,500 周安装