⚠️

重要前提

安装AI Skills的关键前提是：必须科学上网，且开启TUN模式，这一点至关重要，直接决定安装能否顺利完成，在此郑重提醒三遍：科学上网，科学上网，科学上网。查看完整安装教程 →

verl-rl-training：字节跳动开源大语言模型强化学习训练库，支持PPO/GRPO/DAPO等算法

verl-rl-training by orchestra-research/ai-research-skills

69 周安装量

5,600 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/orchestra-research/ai-research-skills --skill verl-rl-training

AI/机器学习强化学习开源项目

🇨🇳中文介绍

verl：火山引擎大语言模型强化学习库

verl 是字节跳动 Seed 团队开发的一个灵活、高效且可用于生产环境的大语言模型强化学习训练库。它实现了 HybridFlow 框架（EuroSys 2025），并支持了如豆包-1.5-pro 等模型在数学基准测试上达到 O1 级别性能。

何时使用 verl

在以下情况选择 verl：

需要大规模、可用于生产环境的 RL 训练（已测试至 671B 参数）
需要灵活切换后端（FSDP ↔ Megatron-LM ↔ vLLM ↔ SGLang）
支持多种 RL 算法（PPO、GRPO、RLOO、REINFORCE++、DAPO）
支持多轮次生成和工具调用，用于智能体工作流
视觉语言模型的 RL 训练

在以下情况考虑其他方案：

需要原生 Megatron 训练 → 使用 slime 或 miles
想要使用 Monarch 的 PyTorch 原生抽象 → 使用 torchforge
仅需简单的 SFT/DPO → 使用 TRL 或 Axolotl

核心特性

训练后端：FSDP、FSDP2、Megatron-LM
生成引擎：vLLM、SGLang、HuggingFace Transformers
算法：PPO、GRPO、DAPO、RLOO、ReMax、REINFORCE++、SPIN、SPPO
模型：Qwen-3、Llama-3.1、DeepSeek、Gemma-2（0.5B 至 671B）
高级功能：LoRA RL、序列并行、专家并行、多轮次工具调用

安装

# 选项 1：pip 安装
pip install verl[vllm]  # 或 verl[sglang] 用于 SGLang 后端

# 选项 2：Docker（生产环境推荐）
docker pull verlai/verl:vllm011.latest

# 选项 3：从源码安装
git clone https://github.com/volcengine/verl.git
cd verl && pip install -e .[vllm,math]

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

快速开始：GRPO 训练

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=~/data/gsm8k/train.parquet \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-7B \
    actor_rollout_ref.rollout.n=8 \
    actor_rollout_ref.actor.use_kl_loss=True \
    trainer.n_gpus_per_node=8

verl 使用 HybridFlow 编程模型，将控制流与计算分离：

┌─────────────────────────────────────────────────────────┐
│ 单进程控制器 (Ray)                                      │
│ - 编排：生成 → 奖励 → 训练 → 同步                       │
└─────────────────────┬───────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────┐
│ 多进程工作器                                            │
│ ├── ActorRolloutRefWorker (策略 + 生成)                 │
│ ├── CriticWorker (价值估计，仅 PPO)                     │
│ └── RewardManager (基于模型或规则的奖励)                │
└─────────────────────────────────────────────────────────┘

工作流 1：使用 GRPO 进行数学推理

此工作流用于在 GSM8K 或 MATH 等数学任务上训练推理模型。

配备 8+ 个 GPU 的集群（推荐 H100）
包含 prompt 和 reward_model 列的 parquet 格式数据集
来自 HuggingFace Hub 的基础模型

步骤 1：准备数据集

import pandas as pd

data = [
    {
        "prompt": [{"role": "user", "content": "What is 15 + 27?"}],
        "reward_model": {"ground_truth": "42"}
    },
    # ... 更多示例
]
df = pd.DataFrame(data)
df.to_parquet("train.parquet")

步骤 2：定义奖励函数

# reward_function.py
import re

def compute_reward(responses, ground_truths):
    rewards = []
    for response, gt in zip(responses, ground_truths):
        # 从响应中提取答案
        match = re.search(r'\\boxed{([^}]+)}', response)
        if match and match.group(1).strip() == gt.strip():
            rewards.append(1.0)
        else:
            rewards.append(0.0)
    return rewards

步骤 3：创建训练配置

# config/grpo_math.yaml
algorithm:
  adv_estimator: grpo
  gamma: 1.0
  lam: 1.0

data:
  train_files: /path/to/train.parquet
  val_files: /path/to/val.parquet
  train_batch_size: 256
  max_prompt_length: 512
  max_response_length: 2048

actor_rollout_ref:
  model:
    path: Qwen/Qwen2.5-7B-Instruct
  actor:
    use_kl_loss: true
    kl_loss_coef: 0.001
    ppo_mini_batch_size: 64
  rollout:
    name: vllm
    n: 8  # 每个提示的采样数
    temperature: 0.7
    top_p: 0.95

trainer:
  total_epochs: 3
  n_gpus_per_node: 8
  save_freq: 100

步骤 4：启动训练

python3 -m verl.trainer.main_ppo \
    --config-path config \
    --config-name grpo_math \
    trainer.experiment_name=grpo_math_qwen7b

步骤 5：监控与验证

检查 WandB/TensorBoard 中的损失曲线
验证奖励随训练步数增加而提升
在保留的测试集上运行评估

工作流 2：使用评论家模型的 PPO

当需要基于价值的优势估计（GAE）时，使用此工作流。

与 GRPO 的主要区别

需要独立的评论家模型
使用广义优势估计（GAE）
更适合具有密集奖励的任务

algorithm:
  adv_estimator: gae  # 使用 GAE 而非 GRPO
  gamma: 0.99
  lam: 0.95

critic:
  model:
    path: Qwen/Qwen2.5-7B-Instruct  # 可与演员模型相同或不同
  ppo_mini_batch_size: 64

actor_rollout_ref:
  actor:
    use_kl_loss: true
    kl_loss_coef: 0.02
    clip_ratio: 0.2  # PPO 裁剪

启动训练（带评论家）

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=gae \
    critic.model.path=Qwen/Qwen2.5-7B-Instruct \
    trainer.n_gpus_per_node=8

工作流 3：使用 Megatron 进行大规模训练

当模型参数超过 70B 或需要专家并行时，使用此工作流。

安装 Megatron-LM 桥接器：pip install mbridge
将模型转换为 Megatron 格式
配备 NVLink/InfiniBand 的多节点集群

70B+ 模型的配置

actor_rollout_ref:
  model:
    path: /path/to/megatron/checkpoint
    backend: megatron
  actor:
    strategy: megatron
    tensor_model_parallel_size: 8
    pipeline_model_parallel_size: 2
  rollout:
    name: vllm
    tensor_parallel_size: 8

启动多节点训练

# 在头节点
ray start --head --port=6379

# 在工作节点
ray start --address='head_ip:6379'

# 启动训练
python3 -m verl.trainer.main_ppo \
    trainer.nnodes=4 \
    trainer.n_gpus_per_node=8

算法	`adv_estimator`	使用场景
GRPO	`grpo`	无需评论家，数学/推理
PPO/GAE	`gae`	密集奖励，价值估计
REINFORCE++	`reinforce_plus_plus`	方差缩减
RLOO	`rloo`	留一法基线
ReMax	`remax`	最大奖励基线
OPO	`opo`	最优策略优化

# 生成参数
actor_rollout_ref.rollout.n: 8              # 每个提示的采样数
actor_rollout_ref.rollout.temperature: 0.7  # 采样温度
actor_rollout_ref.rollout.top_p: 0.95       # 核采样

# 训练参数
actor_rollout_ref.actor.lr: 1e-6            # 学习率
actor_rollout_ref.actor.ppo_mini_batch_size: 64
actor_rollout_ref.actor.clip_ratio: 0.2     # PPO 裁剪范围

# KL 控制
actor_rollout_ref.actor.use_kl_loss: true
actor_rollout_ref.actor.kl_loss_coef: 0.001
algorithm.kl_ctrl.target_kl: 0.1            # 用于自适应 KL 控制

常见问题与解决方案

问题：生成阶段内存溢出

症状：生成阶段出现 CUDA 内存不足

# 减小批次大小
actor_rollout_ref.rollout.log_prob_micro_batch_size: 4

# 启用梯度检查点
actor_rollout_ref.model.enable_gradient_checkpointing: true

# 使用带 CPU 卸载的 FSDP2
actor_rollout_ref.actor.strategy: fsdp2
actor_rollout_ref.actor.fsdp_config.offload_policy: true

问题：训练不稳定

症状：损失值飙升，奖励崩溃

# 降低学习率
actor_rollout_ref.actor.lr: 5e-7

# 增加 KL 惩罚
actor_rollout_ref.actor.kl_loss_coef: 0.01

# 启用梯度裁剪
actor_rollout_ref.actor.max_grad_norm: 1.0

问题：权重同步缓慢

症状：生成与训练之间存在长时间停顿

# 使用 FSDP2 以获得更快的重分片
actor_rollout_ref.actor.strategy=fsdp2

# 启用异步权重传输
trainer.async_weight_update=true

问题：vLLM 版本不匹配

症状：导入错误或生成失败

解决方案：使用兼容的版本：

pip install vllm>=0.8.5,<=0.12.0
# 避免 vLLM 0.7.x（已知问题）

多轮次工具调用

有关使用工具的智能体工作流，请参阅 references/multi-turn.md。

actor_rollout_ref:
  model:
    path: Qwen/Qwen2.5-VL-7B-Instruct
  rollout:
    name: vllm
    enable_vision: true

actor_rollout_ref:
  actor:
    lora:
      enabled: true
      r: 16
      alpha: 32
      target_modules: ["q_proj", "v_proj"]

文档：https://verl.readthedocs.io/
论文：https://arxiv.org/abs/2409.19256
GitHub：https://github.com/volcengine/verl
配方：https://github.com/verl-project/verl-recipe（DAPO、GSPO 等）
社区：Slack 上的 verl-project

🇺🇸English

verl: Volcano Engine Reinforcement Learning for LLMs

verl is a flexible, efficient, and production-ready RL training library for large language models from ByteDance's Seed team. It implements the HybridFlow framework (EuroSys 2025) and powers models like Doubao-1.5-pro achieving O1-level performance on math benchmarks.

When to Use verl

Choose verl when you need:

Production-ready RL training at scale (tested up to 671B parameters)
Flexibility to swap backends (FSDP ↔ Megatron-LM ↔ vLLM ↔ SGLang)
Support for multiple RL algorithms (PPO, GRPO, RLOO, REINFORCE++, DAPO)
Multi-turn rollout with tool calling for agentic workflows
Vision-language model RL training

Consider alternatives when:

You need Megatron-native training → use slime or miles
You want PyTorch-native abstractions with Monarch → use torchforge
You only need simple SFT/DPO → use TRL or Axolotl

Key Features

Training backends : FSDP, FSDP2, Megatron-LM
Rollout engines : vLLM, SGLang, HuggingFace Transformers
Algorithms : PPO, GRPO, DAPO, RLOO, ReMax, REINFORCE++, SPIN, SPPO
Models : Qwen-3, Llama-3.1, DeepSeek, Gemma-2 (0.5B to 671B)
Advanced : LoRA RL, sequence parallelism, expert parallelism, multi-turn tools

Installation

# Option 1: pip install
pip install verl[vllm]  # or verl[sglang] for SGLang backend

# Option 2: Docker (recommended for production)
docker pull verlai/verl:vllm011.latest

# Option 3: From source
git clone https://github.com/volcengine/verl.git
cd verl && pip install -e .[vllm,math]

Quick Start: GRPO Training

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=~/data/gsm8k/train.parquet \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-7B \
    actor_rollout_ref.rollout.n=8 \
    actor_rollout_ref.actor.use_kl_loss=True \
    trainer.n_gpus_per_node=8

Core Architecture

verl uses a HybridFlow programming model separating control flow from computation:

┌─────────────────────────────────────────────────────────┐
│ Single-Process Controller (Ray)                         │
│ - Orchestrates: rollout → reward → train → sync        │
└─────────────────────┬───────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────┐
│ Multi-Process Workers                                   │
│ ├── ActorRolloutRefWorker (policy + generation)        │
│ ├── CriticWorker (value estimation, PPO only)          │
│ └── RewardManager (model-based or rule-based rewards)  │
└─────────────────────────────────────────────────────────┘

Workflow 1: Math Reasoning with GRPO

Use this workflow for training reasoning models on math tasks like GSM8K or MATH.

Prerequisites Checklist

GPU cluster with 8+ GPUs (H100 recommended)
Dataset in parquet format with prompt and reward_model columns
Base model from HuggingFace Hub

Step 1: Prepare Dataset

import pandas as pd

data = [
    {
        "prompt": [{"role": "user", "content": "What is 15 + 27?"}],
        "reward_model": {"ground_truth": "42"}
    },
    # ... more examples
]
df = pd.DataFrame(data)
df.to_parquet("train.parquet")

Step 2: Define Reward Function

# reward_function.py
import re

def compute_reward(responses, ground_truths):
    rewards = []
    for response, gt in zip(responses, ground_truths):
        # Extract answer from response
        match = re.search(r'\\boxed{([^}]+)}', response)
        if match and match.group(1).strip() == gt.strip():
            rewards.append(1.0)
        else:
            rewards.append(0.0)
    return rewards

Step 3: Create Training Config

# config/grpo_math.yaml
algorithm:
  adv_estimator: grpo
  gamma: 1.0
  lam: 1.0

data:
  train_files: /path/to/train.parquet
  val_files: /path/to/val.parquet
  train_batch_size: 256
  max_prompt_length: 512
  max_response_length: 2048

actor_rollout_ref:
  model:
    path: Qwen/Qwen2.5-7B-Instruct
  actor:
    use_kl_loss: true
    kl_loss_coef: 0.001
    ppo_mini_batch_size: 64
  rollout:
    name: vllm
    n: 8  # samples per prompt
    temperature: 0.7
    top_p: 0.95

trainer:
  total_epochs: 3
  n_gpus_per_node: 8
  save_freq: 100

Step 4: Launch Training

python3 -m verl.trainer.main_ppo \
    --config-path config \
    --config-name grpo_math \
    trainer.experiment_name=grpo_math_qwen7b

Step 5: Monitor and Validate

Check WandB/TensorBoard for loss curves
Verify reward is increasing over steps
Run evaluation on held-out test set

Workflow 2: PPO with Critic Model

Use this workflow when you need value-based advantage estimation (GAE).

Key Differences from GRPO

Requires separate critic model
Uses Generalized Advantage Estimation (GAE)
Better for tasks with dense rewards

Configuration

algorithm:
  adv_estimator: gae  # Use GAE instead of GRPO
  gamma: 0.99
  lam: 0.95

critic:
  model:
    path: Qwen/Qwen2.5-7B-Instruct  # Can be same or different from actor
  ppo_mini_batch_size: 64

actor_rollout_ref:
  actor:
    use_kl_loss: true
    kl_loss_coef: 0.02
    clip_ratio: 0.2  # PPO clipping

Launch with Critic

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=gae \
    critic.model.path=Qwen/Qwen2.5-7B-Instruct \
    trainer.n_gpus_per_node=8

Workflow 3: Large-Scale Training with Megatron

Use this workflow for models >70B parameters or when you need expert parallelism.

Prerequisites

Install Megatron-LM bridge: pip install mbridge
Convert model to Megatron format
Multi-node cluster with NVLink/InfiniBand

Configuration for 70B+ Models

actor_rollout_ref:
  model:
    path: /path/to/megatron/checkpoint
    backend: megatron
  actor:
    strategy: megatron
    tensor_model_parallel_size: 8
    pipeline_model_parallel_size: 2
  rollout:
    name: vllm
    tensor_parallel_size: 8

Launch Multi-Node

# On head node
ray start --head --port=6379

# On worker nodes
ray start --address='head_ip:6379'

# Launch training
python3 -m verl.trainer.main_ppo \
    trainer.nnodes=4 \
    trainer.n_gpus_per_node=8

Configuration Reference

Algorithm Selection

Algorithm	`adv_estimator`	Use Case
GRPO	`grpo`	Critic-free, math/reasoning
PPO/GAE	`gae`	Dense rewards, value estimation
REINFORCE++	`reinforce_plus_plus`	Variance reduction
RLOO	`rloo`	Leave-one-out baseline
ReMax	`remax`

Key Parameters

# Rollout parameters
actor_rollout_ref.rollout.n: 8              # Samples per prompt
actor_rollout_ref.rollout.temperature: 0.7  # Sampling temperature
actor_rollout_ref.rollout.top_p: 0.95       # Nucleus sampling

# Training parameters
actor_rollout_ref.actor.lr: 1e-6            # Learning rate
actor_rollout_ref.actor.ppo_mini_batch_size: 64
actor_rollout_ref.actor.clip_ratio: 0.2     # PPO clip range

# KL control
actor_rollout_ref.actor.use_kl_loss: true
actor_rollout_ref.actor.kl_loss_coef: 0.001
algorithm.kl_ctrl.target_kl: 0.1            # For adaptive KL control

Common Issues and Solutions

Issue: OOM During Rollout

Symptoms : CUDA out of memory during generation phase

Solutions :

# Reduce batch size
actor_rollout_ref.rollout.log_prob_micro_batch_size: 4

# Enable gradient checkpointing
actor_rollout_ref.model.enable_gradient_checkpointing: true

# Use FSDP2 with CPU offloading
actor_rollout_ref.actor.strategy: fsdp2
actor_rollout_ref.actor.fsdp_config.offload_policy: true

Issue: Training Instability

Symptoms : Loss spikes, reward collapse

Solutions :

# Reduce learning rate
actor_rollout_ref.actor.lr: 5e-7

# Increase KL penalty
actor_rollout_ref.actor.kl_loss_coef: 0.01

# Enable gradient clipping
actor_rollout_ref.actor.max_grad_norm: 1.0

Issue: Slow Weight Sync

Symptoms : Long pauses between rollout and training

Solutions :

# Use FSDP2 for faster resharding
actor_rollout_ref.actor.strategy=fsdp2

# Enable async weight transfer
trainer.async_weight_update=true

Issue: vLLM Version Mismatch

Symptoms : Import errors or generation failures

Solution : Use compatible versions:

pip install vllm>=0.8.5,<=0.12.0
# Avoid vLLM 0.7.x (known bugs)

Advanced Topics

Multi-Turn Tool Calling

See references/multi-turn.md for agentic workflows with tool use.

Vision-Language Models

actor_rollout_ref:
  model:
    path: Qwen/Qwen2.5-VL-7B-Instruct
  rollout:
    name: vllm
    enable_vision: true

LoRA Training

actor_rollout_ref:
  actor:
    lora:
      enabled: true
      r: 16
      alpha: 32
      target_modules: ["q_proj", "v_proj"]

Resources

Documentation : https://verl.readthedocs.io/
Paper : https://arxiv.org/abs/2409.19256
GitHub : https://github.com/volcengine/verl
Recipes : https://github.com/verl-project/verl-recipe (DAPO, GSPO, etc.)
Community : Slack at verl-project

Weekly Installs

Repository

orchestra-resea…h-skills

GitHub Stars

5.6K

First Seen

Feb 7, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

codex60

opencode60

cursor60

gemini-cli59

github-copilot58

claude-code57

超能力技能使用指南：AI助手技能调用优先级与工作流程详解

53,700 周安装

verl-rl-training：字节跳动开源大语言模型强化学习训练库，支持PPO/GRPO/DAPO等算法

🇨🇳中文介绍

verl：火山引擎大语言模型强化学习库

何时使用 verl

核心特性

安装

相关 Skills

快速开始：GRPO 训练

核心架构

工作流 1：使用 GRPO 进行数学推理

先决条件清单

步骤 1：准备数据集

步骤 2：定义奖励函数

步骤 3：创建训练配置

步骤 4：启动训练

步骤 5：监控与验证

工作流 2：使用评论家模型的 PPO

与 GRPO 的主要区别

配置

启动训练（带评论家）

工作流 3：使用 Megatron 进行大规模训练

先决条件

70B+ 模型的配置

启动多节点训练

配置参考

算法选择

关键参数

常见问题与解决方案

问题：生成阶段内存溢出

问题：训练不稳定

问题：权重同步缓慢

问题：vLLM 版本不匹配

高级主题

多轮次工具调用

视觉语言模型

LoRA 训练

资源

🇺🇸English

verl: Volcano Engine Reinforcement Learning for LLMs

When to Use verl

Key Features

Installation

Quick Start: GRPO Training

Core Architecture

Workflow 1: Math Reasoning with GRPO

Prerequisites Checklist

Step 1: Prepare Dataset

Step 2: Define Reward Function

Step 3: Create Training Config

Step 4: Launch Training

Step 5: Monitor and Validate

Workflow 2: PPO with Critic Model

Key Differences from GRPO

Configuration

Launch with Critic

Workflow 3: Large-Scale Training with Megatron

Prerequisites

Configuration for 70B+ Models

Launch Multi-Node

Configuration Reference

Algorithm Selection

Key Parameters

Common Issues and Solutions

Issue: OOM During Rollout

Issue: Training Instability

Issue: Slow Weight Sync

Issue: vLLM Version Mismatch

Advanced Topics

Multi-Turn Tool Calling

Vision-Language Models

LoRA Training

Resources

最新 Skills