重要前提
安装AI Skills的关键前提是:必须科学上网,且开启TUN模式,这一点至关重要,直接决定安装能否顺利完成,在此郑重提醒三遍:科学上网,科学上网,科学上网。查看完整安装教程 →
verl-rl-training by orchestra-research/ai-research-skills
npx skills add https://github.com/orchestra-research/ai-research-skills --skill verl-rl-trainingverl 是字节跳动 Seed 团队开发的一个灵活、高效且可用于生产环境的大语言模型强化学习训练库。它实现了 HybridFlow 框架(EuroSys 2025),并支持了如豆包-1.5-pro 等模型在数学基准测试上达到 O1 级别性能。
在以下情况选择 verl:
在以下情况考虑其他方案:
# 选项 1:pip 安装
pip install verl[vllm] # 或 verl[sglang] 用于 SGLang 后端
# 选项 2:Docker(生产环境推荐)
docker pull verlai/verl:vllm011.latest
# 选项 3:从源码安装
git clone https://github.com/volcengine/verl.git
cd verl && pip install -e .[vllm,math]
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files=~/data/gsm8k/train.parquet \
actor_rollout_ref.model.path=Qwen/Qwen2.5-7B \
actor_rollout_ref.rollout.n=8 \
actor_rollout_ref.actor.use_kl_loss=True \
trainer.n_gpus_per_node=8
verl 使用 HybridFlow 编程模型,将控制流与计算分离:
┌─────────────────────────────────────────────────────────┐
│ 单进程控制器 (Ray) │
│ - 编排:生成 → 奖励 → 训练 → 同步 │
└─────────────────────┬───────────────────────────────────┘
│
┌─────────────────────▼───────────────────────────────────┐
│ 多进程工作器 │
│ ├── ActorRolloutRefWorker (策略 + 生成) │
│ ├── CriticWorker (价值估计,仅 PPO) │
│ └── RewardManager (基于模型或规则的奖励) │
└─────────────────────────────────────────────────────────┘
此工作流用于在 GSM8K 或 MATH 等数学任务上训练推理模型。
prompt 和 reward_model 列的 parquet 格式数据集import pandas as pd
data = [
{
"prompt": [{"role": "user", "content": "What is 15 + 27?"}],
"reward_model": {"ground_truth": "42"}
},
# ... 更多示例
]
df = pd.DataFrame(data)
df.to_parquet("train.parquet")
# reward_function.py
import re
def compute_reward(responses, ground_truths):
rewards = []
for response, gt in zip(responses, ground_truths):
# 从响应中提取答案
match = re.search(r'\\boxed{([^}]+)}', response)
if match and match.group(1).strip() == gt.strip():
rewards.append(1.0)
else:
rewards.append(0.0)
return rewards
# config/grpo_math.yaml
algorithm:
adv_estimator: grpo
gamma: 1.0
lam: 1.0
data:
train_files: /path/to/train.parquet
val_files: /path/to/val.parquet
train_batch_size: 256
max_prompt_length: 512
max_response_length: 2048
actor_rollout_ref:
model:
path: Qwen/Qwen2.5-7B-Instruct
actor:
use_kl_loss: true
kl_loss_coef: 0.001
ppo_mini_batch_size: 64
rollout:
name: vllm
n: 8 # 每个提示的采样数
temperature: 0.7
top_p: 0.95
trainer:
total_epochs: 3
n_gpus_per_node: 8
save_freq: 100
python3 -m verl.trainer.main_ppo \
--config-path config \
--config-name grpo_math \
trainer.experiment_name=grpo_math_qwen7b
当需要基于价值的优势估计(GAE)时,使用此工作流。
algorithm:
adv_estimator: gae # 使用 GAE 而非 GRPO
gamma: 0.99
lam: 0.95
critic:
model:
path: Qwen/Qwen2.5-7B-Instruct # 可与演员模型相同或不同
ppo_mini_batch_size: 64
actor_rollout_ref:
actor:
use_kl_loss: true
kl_loss_coef: 0.02
clip_ratio: 0.2 # PPO 裁剪
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=gae \
critic.model.path=Qwen/Qwen2.5-7B-Instruct \
trainer.n_gpus_per_node=8
当模型参数超过 70B 或需要专家并行时,使用此工作流。
pip install mbridgeactor_rollout_ref:
model:
path: /path/to/megatron/checkpoint
backend: megatron
actor:
strategy: megatron
tensor_model_parallel_size: 8
pipeline_model_parallel_size: 2
rollout:
name: vllm
tensor_parallel_size: 8
# 在头节点
ray start --head --port=6379
# 在工作节点
ray start --address='head_ip:6379'
# 启动训练
python3 -m verl.trainer.main_ppo \
trainer.nnodes=4 \
trainer.n_gpus_per_node=8
| 算法 | adv_estimator | 使用场景 |
|---|---|---|
| GRPO | grpo | 无需评论家,数学/推理 |
| PPO/GAE | gae | 密集奖励,价值估计 |
| REINFORCE++ | reinforce_plus_plus | 方差缩减 |
| RLOO | rloo | 留一法基线 |
| ReMax | remax | 最大奖励基线 |
| OPO | opo | 最优策略优化 |
# 生成参数
actor_rollout_ref.rollout.n: 8 # 每个提示的采样数
actor_rollout_ref.rollout.temperature: 0.7 # 采样温度
actor_rollout_ref.rollout.top_p: 0.95 # 核采样
# 训练参数
actor_rollout_ref.actor.lr: 1e-6 # 学习率
actor_rollout_ref.actor.ppo_mini_batch_size: 64
actor_rollout_ref.actor.clip_ratio: 0.2 # PPO 裁剪范围
# KL 控制
actor_rollout_ref.actor.use_kl_loss: true
actor_rollout_ref.actor.kl_loss_coef: 0.001
algorithm.kl_ctrl.target_kl: 0.1 # 用于自适应 KL 控制
症状:生成阶段出现 CUDA 内存不足
解决方案:
# 减小批次大小
actor_rollout_ref.rollout.log_prob_micro_batch_size: 4
# 启用梯度检查点
actor_rollout_ref.model.enable_gradient_checkpointing: true
# 使用带 CPU 卸载的 FSDP2
actor_rollout_ref.actor.strategy: fsdp2
actor_rollout_ref.actor.fsdp_config.offload_policy: true
症状:损失值飙升,奖励崩溃
解决方案:
# 降低学习率
actor_rollout_ref.actor.lr: 5e-7
# 增加 KL 惩罚
actor_rollout_ref.actor.kl_loss_coef: 0.01
# 启用梯度裁剪
actor_rollout_ref.actor.max_grad_norm: 1.0
症状:生成与训练之间存在长时间停顿
解决方案:
# 使用 FSDP2 以获得更快的重分片
actor_rollout_ref.actor.strategy=fsdp2
# 启用异步权重传输
trainer.async_weight_update=true
症状:导入错误或生成失败
解决方案:使用兼容的版本:
pip install vllm>=0.8.5,<=0.12.0
# 避免 vLLM 0.7.x(已知问题)
有关使用工具的智能体工作流,请参阅 references/multi-turn.md。
actor_rollout_ref:
model:
path: Qwen/Qwen2.5-VL-7B-Instruct
rollout:
name: vllm
enable_vision: true
actor_rollout_ref:
actor:
lora:
enabled: true
r: 16
alpha: 32
target_modules: ["q_proj", "v_proj"]
每周安装量
69
代码库
GitHub 星标数
5.6K
首次出现
2026年2月7日
安全审计
安装于
codex60
opencode60
cursor60
gemini-cli59
github-copilot58
claude-code57
verl is a flexible, efficient, and production-ready RL training library for large language models from ByteDance's Seed team. It implements the HybridFlow framework (EuroSys 2025) and powers models like Doubao-1.5-pro achieving O1-level performance on math benchmarks.
Choose verl when you need:
Consider alternatives when:
# Option 1: pip install
pip install verl[vllm] # or verl[sglang] for SGLang backend
# Option 2: Docker (recommended for production)
docker pull verlai/verl:vllm011.latest
# Option 3: From source
git clone https://github.com/volcengine/verl.git
cd verl && pip install -e .[vllm,math]
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files=~/data/gsm8k/train.parquet \
actor_rollout_ref.model.path=Qwen/Qwen2.5-7B \
actor_rollout_ref.rollout.n=8 \
actor_rollout_ref.actor.use_kl_loss=True \
trainer.n_gpus_per_node=8
verl uses a HybridFlow programming model separating control flow from computation:
┌─────────────────────────────────────────────────────────┐
│ Single-Process Controller (Ray) │
│ - Orchestrates: rollout → reward → train → sync │
└─────────────────────┬───────────────────────────────────┘
│
┌─────────────────────▼───────────────────────────────────┐
│ Multi-Process Workers │
│ ├── ActorRolloutRefWorker (policy + generation) │
│ ├── CriticWorker (value estimation, PPO only) │
│ └── RewardManager (model-based or rule-based rewards) │
└─────────────────────────────────────────────────────────┘
Use this workflow for training reasoning models on math tasks like GSM8K or MATH.
prompt and reward_model columnsimport pandas as pd
data = [
{
"prompt": [{"role": "user", "content": "What is 15 + 27?"}],
"reward_model": {"ground_truth": "42"}
},
# ... more examples
]
df = pd.DataFrame(data)
df.to_parquet("train.parquet")
# reward_function.py
import re
def compute_reward(responses, ground_truths):
rewards = []
for response, gt in zip(responses, ground_truths):
# Extract answer from response
match = re.search(r'\\boxed{([^}]+)}', response)
if match and match.group(1).strip() == gt.strip():
rewards.append(1.0)
else:
rewards.append(0.0)
return rewards
# config/grpo_math.yaml
algorithm:
adv_estimator: grpo
gamma: 1.0
lam: 1.0
data:
train_files: /path/to/train.parquet
val_files: /path/to/val.parquet
train_batch_size: 256
max_prompt_length: 512
max_response_length: 2048
actor_rollout_ref:
model:
path: Qwen/Qwen2.5-7B-Instruct
actor:
use_kl_loss: true
kl_loss_coef: 0.001
ppo_mini_batch_size: 64
rollout:
name: vllm
n: 8 # samples per prompt
temperature: 0.7
top_p: 0.95
trainer:
total_epochs: 3
n_gpus_per_node: 8
save_freq: 100
python3 -m verl.trainer.main_ppo \
--config-path config \
--config-name grpo_math \
trainer.experiment_name=grpo_math_qwen7b
Use this workflow when you need value-based advantage estimation (GAE).
algorithm:
adv_estimator: gae # Use GAE instead of GRPO
gamma: 0.99
lam: 0.95
critic:
model:
path: Qwen/Qwen2.5-7B-Instruct # Can be same or different from actor
ppo_mini_batch_size: 64
actor_rollout_ref:
actor:
use_kl_loss: true
kl_loss_coef: 0.02
clip_ratio: 0.2 # PPO clipping
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=gae \
critic.model.path=Qwen/Qwen2.5-7B-Instruct \
trainer.n_gpus_per_node=8
Use this workflow for models >70B parameters or when you need expert parallelism.
pip install mbridgeactor_rollout_ref:
model:
path: /path/to/megatron/checkpoint
backend: megatron
actor:
strategy: megatron
tensor_model_parallel_size: 8
pipeline_model_parallel_size: 2
rollout:
name: vllm
tensor_parallel_size: 8
# On head node
ray start --head --port=6379
# On worker nodes
ray start --address='head_ip:6379'
# Launch training
python3 -m verl.trainer.main_ppo \
trainer.nnodes=4 \
trainer.n_gpus_per_node=8
| Algorithm | adv_estimator | Use Case |
|---|---|---|
| GRPO | grpo | Critic-free, math/reasoning |
| PPO/GAE | gae | Dense rewards, value estimation |
| REINFORCE++ | reinforce_plus_plus | Variance reduction |
| RLOO | rloo | Leave-one-out baseline |
| ReMax | remax |
# Rollout parameters
actor_rollout_ref.rollout.n: 8 # Samples per prompt
actor_rollout_ref.rollout.temperature: 0.7 # Sampling temperature
actor_rollout_ref.rollout.top_p: 0.95 # Nucleus sampling
# Training parameters
actor_rollout_ref.actor.lr: 1e-6 # Learning rate
actor_rollout_ref.actor.ppo_mini_batch_size: 64
actor_rollout_ref.actor.clip_ratio: 0.2 # PPO clip range
# KL control
actor_rollout_ref.actor.use_kl_loss: true
actor_rollout_ref.actor.kl_loss_coef: 0.001
algorithm.kl_ctrl.target_kl: 0.1 # For adaptive KL control
Symptoms : CUDA out of memory during generation phase
Solutions :
# Reduce batch size
actor_rollout_ref.rollout.log_prob_micro_batch_size: 4
# Enable gradient checkpointing
actor_rollout_ref.model.enable_gradient_checkpointing: true
# Use FSDP2 with CPU offloading
actor_rollout_ref.actor.strategy: fsdp2
actor_rollout_ref.actor.fsdp_config.offload_policy: true
Symptoms : Loss spikes, reward collapse
Solutions :
# Reduce learning rate
actor_rollout_ref.actor.lr: 5e-7
# Increase KL penalty
actor_rollout_ref.actor.kl_loss_coef: 0.01
# Enable gradient clipping
actor_rollout_ref.actor.max_grad_norm: 1.0
Symptoms : Long pauses between rollout and training
Solutions :
# Use FSDP2 for faster resharding
actor_rollout_ref.actor.strategy=fsdp2
# Enable async weight transfer
trainer.async_weight_update=true
Symptoms : Import errors or generation failures
Solution : Use compatible versions:
pip install vllm>=0.8.5,<=0.12.0
# Avoid vLLM 0.7.x (known bugs)
See references/multi-turn.md for agentic workflows with tool use.
actor_rollout_ref:
model:
path: Qwen/Qwen2.5-VL-7B-Instruct
rollout:
name: vllm
enable_vision: true
actor_rollout_ref:
actor:
lora:
enabled: true
r: 16
alpha: 32
target_modules: ["q_proj", "v_proj"]
Weekly Installs
69
Repository
GitHub Stars
5.6K
First Seen
Feb 7, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
codex60
opencode60
cursor60
gemini-cli59
github-copilot58
claude-code57
超能力技能使用指南:AI助手技能调用优先级与工作流程详解
53,700 周安装
| Maximum reward baseline |
| OPO | opo | Optimal policy optimization |