⚠️

重要前提

安装AI Skills的关键前提是：必须科学上网，且开启TUN模式，这一点至关重要，直接决定安装能否顺利完成，在此郑重提醒三遍：科学上网，科学上网，科学上网。查看完整安装教程 →

verl-rl-training：字节跳动开源大语言模型强化学习库，支持PPO/GRPO/SPIN等算法

verl-rl-training by davila7/claude-code-templates

70 周安装量

23,400 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/davila7/claude-code-templates --skill verl-rl-training

AI/机器学习 PyTorch 开源项目

🇨🇳中文介绍

verl: 火山引擎大语言模型强化学习库

verl 是字节跳动 Seed 团队开发的一个灵活、高效且生产就绪的大语言模型强化学习训练库。它实现了 HybridFlow 框架（EuroSys 2025），并支持了如豆包-1.5-pro 等模型在数学基准测试上达到 O1 级别性能。

何时使用 verl

在以下情况下选择 verl：

需要生产就绪的大规模 RL 训练（已测试至 671B 参数）
需要灵活切换训练后端（FSDP ↔ Megatron-LM ↔ vLLM ↔ SGLang）
支持多种 RL 算法（PPO、GRPO、RLOO、REINFORCE++、DAPO）
支持多轮次生成与工具调用，适用于智能体工作流
支持视觉语言模型的 RL 训练

在以下情况下考虑替代方案：

需要原生的 Megatron 训练 → 使用 slime 或 miles
想要使用 Monarch 的 PyTorch 原生抽象 → 使用 torchforge
仅需要简单的 SFT/DPO → 使用 TRL 或 Axolotl

核心特性

训练后端：FSDP、FSDP2、Megatron-LM
生成引擎：vLLM、SGLang、HuggingFace Transformers
算法：PPO、GRPO、DAPO、RLOO、ReMax、REINFORCE++、SPIN、SPPO
模型：Qwen-3、Llama-3.1、DeepSeek、Gemma-2（0.5B 至 671B）
高级功能：LoRA RL、序列并行、专家并行、多轮次工具调用

安装

# 选项 1: pip 安装
pip install verl[vllm]  # 或 verl[sglang] 用于 SGLang 后端

# 选项 2: Docker（生产环境推荐）
docker pull verlai/verl:vllm011.latest

# 选项 3: 从源码安装
git clone https://github.com/volcengine/verl.git
cd verl && pip install -e .[vllm,math]

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

快速开始：GRPO 训练

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=~/data/gsm8k/train.parquet \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-7B \
    actor_rollout_ref.rollout.n=8 \
    actor_rollout_ref.actor.use_kl_loss=True \
    trainer.n_gpus_per_node=8

verl 使用 HybridFlow 编程模型，将控制流与计算分离：

┌─────────────────────────────────────────────────────────┐
│ 单进程控制器 (Ray)                                      │
│ - 编排流程: 生成 → 奖励 → 训练 → 同步                   │
└─────────────────────┬───────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────┐
│ 多进程工作器                                           │
│ ├── ActorRolloutRefWorker (策略 + 生成)                │
│ ├── CriticWorker (价值估计，仅 PPO)                    │
│ └── RewardManager (基于模型或规则的奖励)               │
└─────────────────────────────────────────────────────────┘

工作流 1：使用 GRPO 进行数学推理训练

此工作流适用于在 GSM8K 或 MATH 等数学任务上训练推理模型。

配备 8+ 个 GPU 的集群（推荐 H100）
包含 prompt 和 reward_model 列的 parquet 格式数据集
来自 HuggingFace Hub 的基础模型

步骤 1：准备数据集

import pandas as pd

data = [
    {
        "prompt": [{"role": "user", "content": "What is 15 + 27?"}],
        "reward_model": {"ground_truth": "42"}
    },
    # ... 更多示例
]
df = pd.DataFrame(data)
df.to_parquet("train.parquet")

步骤 2：定义奖励函数

# reward_function.py
import re

def compute_reward(responses, ground_truths):
    rewards = []
    for response, gt in zip(responses, ground_truths):
        # 从响应中提取答案
        match = re.search(r'\\boxed{([^}]+)}', response)
        if match and match.group(1).strip() == gt.strip():
            rewards.append(1.0)
        else:
            rewards.append(0.0)
    return rewards

步骤 3：创建训练配置

# config/grpo_math.yaml
algorithm:
  adv_estimator: grpo
  gamma: 1.0
  lam: 1.0

data:
  train_files: /path/to/train.parquet
  val_files: /path/to/val.parquet
  train_batch_size: 256
  max_prompt_length: 512
  max_response_length: 2048

actor_rollout_ref:
  model:
    path: Qwen/Qwen2.5-7B-Instruct
  actor:
    use_kl_loss: true
    kl_loss_coef: 0.001
    ppo_mini_batch_size: 64
  rollout:
    name: vllm
    n: 8  # 每个提示的采样数
    temperature: 0.7
    top_p: 0.95

trainer:
  total_epochs: 3
  n_gpus_per_node: 8
  save_freq: 100

步骤 4：启动训练

python3 -m verl.trainer.main_ppo \
    --config-path config \
    --config-name grpo_math \
    trainer.experiment_name=grpo_math_qwen7b

步骤 5：监控与验证

在 WandB/TensorBoard 中检查损失曲线
验证奖励随训练步数增加而提升
在保留的测试集上运行评估

工作流 2：使用 Critic 模型的 PPO 训练

当您需要基于价值的优势估计（GAE）时，使用此工作流。

与 GRPO 的主要区别

需要独立的 critic 模型
使用广义优势估计（GAE）
更适合具有密集奖励的任务

algorithm:
  adv_estimator: gae  # 使用 GAE 而非 GRPO
  gamma: 0.99
  lam: 0.95

critic:
  model:
    path: Qwen/Qwen2.5-7B-Instruct  # 可与 actor 相同或不同
  ppo_mini_batch_size: 64

actor_rollout_ref:
  actor:
    use_kl_loss: true
    kl_loss_coef: 0.02
    clip_ratio: 0.2  # PPO 裁剪

启动带 Critic 的训练

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=gae \
    critic.model.path=Qwen/Qwen2.5-7B-Instruct \
    trainer.n_gpus_per_node=8

工作流 3：使用 Megatron 进行大规模训练

当模型参数超过 70B 或需要专家并行时，使用此工作流。

安装 Megatron-LM 桥接器：pip install mbridge
将模型转换为 Megatron 格式
配备 NVLink/InfiniBand 的多节点集群

70B+ 模型的配置

actor_rollout_ref:
  model:
    path: /path/to/megatron/checkpoint
    backend: megatron
  actor:
    strategy: megatron
    tensor_model_parallel_size: 8
    pipeline_model_parallel_size: 2
  rollout:
    name: vllm
    tensor_parallel_size: 8

启动多节点训练

# 在头节点上
ray start --head --port=6379

# 在工作节点上
ray start --address='head_ip:6379'

# 启动训练
python3 -m verl.trainer.main_ppo \
    trainer.nnodes=4 \
    trainer.n_gpus_per_node=8

算法	`adv_estimator`	使用场景
GRPO	`grpo`	无需 critic，数学/推理
PPO/GAE	`gae`	密集奖励，价值估计
REINFORCE++	`reinforce_plus_plus`	方差缩减
RLOO	`rloo`	留一法基线
ReMax	`remax`	最大奖励基线
OPO	`opo`	最优策略优化

# 生成参数
actor_rollout_ref.rollout.n: 8              # 每个提示的采样数
actor_rollout_ref.rollout.temperature: 0.7  # 采样温度
actor_rollout_ref.rollout.top_p: 0.95       # 核心采样

# 训练参数
actor_rollout_ref.actor.lr: 1e-6            # 学习率
actor_rollout_ref.actor.ppo_mini_batch_size: 64
actor_rollout_ref.actor.clip_ratio: 0.2     # PPO 裁剪范围

# KL 控制
actor_rollout_ref.actor.use_kl_loss: true
actor_rollout_ref.actor.kl_loss_coef: 0.001
algorithm.kl_ctrl.target_kl: 0.1            # 用于自适应 KL 控制

常见问题与解决方案

问题：生成阶段内存不足

症状：生成阶段出现 CUDA 内存不足错误

# 减小批次大小
actor_rollout_ref.rollout.log_prob_micro_batch_size: 4

# 启用梯度检查点
actor_rollout_ref.model.enable_gradient_checkpointing: true

# 使用带 CPU 卸载的 FSDP2
actor_rollout_ref.actor.strategy: fsdp2
actor_rollout_ref.actor.fsdp_config.offload_policy: true

问题：训练不稳定

症状：损失值骤增，奖励崩溃

# 降低学习率
actor_rollout_ref.actor.lr: 5e-7

# 增加 KL 惩罚
actor_rollout_ref.actor.kl_loss_coef: 0.01

# 启用梯度裁剪
actor_rollout_ref.actor.max_grad_norm: 1.0

问题：权重同步缓慢

症状：生成与训练之间存在长时间停顿

# 使用 FSDP2 以获得更快的重分片速度
actor_rollout_ref.actor.strategy=fsdp2

# 启用异步权重传输
trainer.async_weight_update=true

问题：vLLM 版本不匹配

症状：导入错误或生成失败

解决方案：使用兼容的版本：

pip install vllm>=0.8.5,<=0.12.0
# 避免使用 vLLM 0.7.x（已知问题）

多轮次工具调用

有关使用工具的智能体工作流，请参阅 references/multi-turn.md。

actor_rollout_ref:
  model:
    path: Qwen/Qwen2.5-VL-7B-Instruct
  rollout:
    name: vllm
    enable_vision: true

actor_rollout_ref:
  actor:
    lora:
      enabled: true
      r: 16
      alpha: 32
      target_modules: ["q_proj", "v_proj"]

文档：https://verl.readthedocs.io/
论文：https://arxiv.org/abs/2409.19256
GitHub：https://github.com/volcengine/verl
配方：https://github.com/verl-project/verl-recipe（DAPO、GSPO 等）
社区：Slack 上的 verl-project

🇺🇸English

verl: Volcano Engine Reinforcement Learning for LLMs

verl is a flexible, efficient, and production-ready RL training library for large language models from ByteDance's Seed team. It implements the HybridFlow framework (EuroSys 2025) and powers models like Doubao-1.5-pro achieving O1-level performance on math benchmarks.

When to Use verl

Choose verl when you need:

Production-ready RL training at scale (tested up to 671B parameters)
Flexibility to swap backends (FSDP ↔ Megatron-LM ↔ vLLM ↔ SGLang)
Support for multiple RL algorithms (PPO, GRPO, RLOO, REINFORCE++, DAPO)
Multi-turn rollout with tool calling for agentic workflows
Vision-language model RL training

Consider alternatives when:

You need Megatron-native training → use slime or miles
You want PyTorch-native abstractions with Monarch → use torchforge
You only need simple SFT/DPO → use TRL or Axolotl

Key Features

Training backends : FSDP, FSDP2, Megatron-LM
Rollout engines : vLLM, SGLang, HuggingFace Transformers
Algorithms : PPO, GRPO, DAPO, RLOO, ReMax, REINFORCE++, SPIN, SPPO
Models : Qwen-3, Llama-3.1, DeepSeek, Gemma-2 (0.5B to 671B)
Advanced : LoRA RL, sequence parallelism, expert parallelism, multi-turn tools

Installation

# Option 1: pip install
pip install verl[vllm]  # or verl[sglang] for SGLang backend

# Option 2: Docker (recommended for production)
docker pull verlai/verl:vllm011.latest

# Option 3: From source
git clone https://github.com/volcengine/verl.git
cd verl && pip install -e .[vllm,math]

Quick Start: GRPO Training

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=~/data/gsm8k/train.parquet \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-7B \
    actor_rollout_ref.rollout.n=8 \
    actor_rollout_ref.actor.use_kl_loss=True \
    trainer.n_gpus_per_node=8

Core Architecture

verl uses a HybridFlow programming model separating control flow from computation:

┌─────────────────────────────────────────────────────────┐
│ Single-Process Controller (Ray)                         │
│ - Orchestrates: rollout → reward → train → sync        │
└─────────────────────┬───────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────┐
│ Multi-Process Workers                                   │
│ ├── ActorRolloutRefWorker (policy + generation)        │
│ ├── CriticWorker (value estimation, PPO only)          │
│ └── RewardManager (model-based or rule-based rewards)  │
└─────────────────────────────────────────────────────────┘

Workflow 1: Math Reasoning with GRPO

Use this workflow for training reasoning models on math tasks like GSM8K or MATH.

Prerequisites Checklist

GPU cluster with 8+ GPUs (H100 recommended)
Dataset in parquet format with prompt and reward_model columns
Base model from HuggingFace Hub

Step 1: Prepare Dataset

import pandas as pd

data = [
    {
        "prompt": [{"role": "user", "content": "What is 15 + 27?"}],
        "reward_model": {"ground_truth": "42"}
    },
    # ... more examples
]
df = pd.DataFrame(data)
df.to_parquet("train.parquet")

Step 2: Define Reward Function

# reward_function.py
import re

def compute_reward(responses, ground_truths):
    rewards = []
    for response, gt in zip(responses, ground_truths):
        # Extract answer from response
        match = re.search(r'\\boxed{([^}]+)}', response)
        if match and match.group(1).strip() == gt.strip():
            rewards.append(1.0)
        else:
            rewards.append(0.0)
    return rewards

Step 3: Create Training Config

# config/grpo_math.yaml
algorithm:
  adv_estimator: grpo
  gamma: 1.0
  lam: 1.0

data:
  train_files: /path/to/train.parquet
  val_files: /path/to/val.parquet
  train_batch_size: 256
  max_prompt_length: 512
  max_response_length: 2048

actor_rollout_ref:
  model:
    path: Qwen/Qwen2.5-7B-Instruct
  actor:
    use_kl_loss: true
    kl_loss_coef: 0.001
    ppo_mini_batch_size: 64
  rollout:
    name: vllm
    n: 8  # samples per prompt
    temperature: 0.7
    top_p: 0.95

trainer:
  total_epochs: 3
  n_gpus_per_node: 8
  save_freq: 100

Step 4: Launch Training

python3 -m verl.trainer.main_ppo \
    --config-path config \
    --config-name grpo_math \
    trainer.experiment_name=grpo_math_qwen7b

Step 5: Monitor and Validate

Check WandB/TensorBoard for loss curves
Verify reward is increasing over steps
Run evaluation on held-out test set

Workflow 2: PPO with Critic Model

Use this workflow when you need value-based advantage estimation (GAE).

Key Differences from GRPO

Requires separate critic model
Uses Generalized Advantage Estimation (GAE)
Better for tasks with dense rewards

Configuration

algorithm:
  adv_estimator: gae  # Use GAE instead of GRPO
  gamma: 0.99
  lam: 0.95

critic:
  model:
    path: Qwen/Qwen2.5-7B-Instruct  # Can be same or different from actor
  ppo_mini_batch_size: 64

actor_rollout_ref:
  actor:
    use_kl_loss: true
    kl_loss_coef: 0.02
    clip_ratio: 0.2  # PPO clipping

Launch with Critic

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=gae \
    critic.model.path=Qwen/Qwen2.5-7B-Instruct \
    trainer.n_gpus_per_node=8

Workflow 3: Large-Scale Training with Megatron

Use this workflow for models >70B parameters or when you need expert parallelism.

Prerequisites

Install Megatron-LM bridge: pip install mbridge
Convert model to Megatron format
Multi-node cluster with NVLink/InfiniBand

Configuration for 70B+ Models

actor_rollout_ref:
  model:
    path: /path/to/megatron/checkpoint
    backend: megatron
  actor:
    strategy: megatron
    tensor_model_parallel_size: 8
    pipeline_model_parallel_size: 2
  rollout:
    name: vllm
    tensor_parallel_size: 8

Launch Multi-Node

# On head node
ray start --head --port=6379

# On worker nodes
ray start --address='head_ip:6379'

# Launch training
python3 -m verl.trainer.main_ppo \
    trainer.nnodes=4 \
    trainer.n_gpus_per_node=8

Configuration Reference

Algorithm Selection

Algorithm	`adv_estimator`	Use Case
GRPO	`grpo`	Critic-free, math/reasoning
PPO/GAE	`gae`	Dense rewards, value estimation
REINFORCE++	`reinforce_plus_plus`	Variance reduction
RLOO	`rloo`	Leave-one-out baseline
ReMax	`remax`

Key Parameters

# Rollout parameters
actor_rollout_ref.rollout.n: 8              # Samples per prompt
actor_rollout_ref.rollout.temperature: 0.7  # Sampling temperature
actor_rollout_ref.rollout.top_p: 0.95       # Nucleus sampling

# Training parameters
actor_rollout_ref.actor.lr: 1e-6            # Learning rate
actor_rollout_ref.actor.ppo_mini_batch_size: 64
actor_rollout_ref.actor.clip_ratio: 0.2     # PPO clip range

# KL control
actor_rollout_ref.actor.use_kl_loss: true
actor_rollout_ref.actor.kl_loss_coef: 0.001
algorithm.kl_ctrl.target_kl: 0.1            # For adaptive KL control

Common Issues and Solutions

Issue: OOM During Rollout

Symptoms : CUDA out of memory during generation phase

Solutions :

# Reduce batch size
actor_rollout_ref.rollout.log_prob_micro_batch_size: 4

# Enable gradient checkpointing
actor_rollout_ref.model.enable_gradient_checkpointing: true

# Use FSDP2 with CPU offloading
actor_rollout_ref.actor.strategy: fsdp2
actor_rollout_ref.actor.fsdp_config.offload_policy: true

Issue: Training Instability

Symptoms : Loss spikes, reward collapse

Solutions :

# Reduce learning rate
actor_rollout_ref.actor.lr: 5e-7

# Increase KL penalty
actor_rollout_ref.actor.kl_loss_coef: 0.01

# Enable gradient clipping
actor_rollout_ref.actor.max_grad_norm: 1.0

Issue: Slow Weight Sync

Symptoms : Long pauses between rollout and training

Solutions :

# Use FSDP2 for faster resharding
actor_rollout_ref.actor.strategy=fsdp2

# Enable async weight transfer
trainer.async_weight_update=true

Issue: vLLM Version Mismatch

Symptoms : Import errors or generation failures

Solution : Use compatible versions:

pip install vllm>=0.8.5,<=0.12.0
# Avoid vLLM 0.7.x (known bugs)

Advanced Topics

Multi-Turn Tool Calling

See references/multi-turn.md for agentic workflows with tool use.

Vision-Language Models

actor_rollout_ref:
  model:
    path: Qwen/Qwen2.5-VL-7B-Instruct
  rollout:
    name: vllm
    enable_vision: true

LoRA Training

actor_rollout_ref:
  actor:
    lora:
      enabled: true
      r: 16
      alpha: 32
      target_modules: ["q_proj", "v_proj"]

Resources

Documentation : https://verl.readthedocs.io/
Paper : https://arxiv.org/abs/2409.19256
GitHub : https://github.com/volcengine/verl
Recipes : https://github.com/verl-project/verl-recipe (DAPO, GSPO, etc.)
Community : Slack at verl-project

Weekly Installs

Repository

davila7/claude-…emplates

GitHub Stars

22.6K

First Seen

Jan 29, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

opencode57

codex55

github-copilot54

gemini-cli52

cursor52

claude-code51

超能力技能使用指南：AI助手技能调用优先级与工作流程详解

53,700 周安装

verl-rl-training：字节跳动开源大语言模型强化学习库，支持PPO/GRPO/SPIN等算法

🇨🇳中文介绍

verl: 火山引擎大语言模型强化学习库

何时使用 verl

核心特性

安装

相关 Skills

快速开始：GRPO 训练

核心架构

工作流 1：使用 GRPO 进行数学推理训练

先决条件清单

步骤 1：准备数据集

步骤 2：定义奖励函数

步骤 3：创建训练配置

步骤 4：启动训练

步骤 5：监控与验证

工作流 2：使用 Critic 模型的 PPO 训练

与 GRPO 的主要区别

配置

启动带 Critic 的训练

工作流 3：使用 Megatron 进行大规模训练

先决条件

70B+ 模型的配置

启动多节点训练

配置参考

算法选择

关键参数

常见问题与解决方案

问题：生成阶段内存不足

问题：训练不稳定

问题：权重同步缓慢

问题：vLLM 版本不匹配

高级主题

多轮次工具调用

视觉语言模型

LoRA 训练

资源

🇺🇸English

verl: Volcano Engine Reinforcement Learning for LLMs

When to Use verl

Key Features

Installation

Quick Start: GRPO Training

Core Architecture

Workflow 1: Math Reasoning with GRPO

Prerequisites Checklist

Step 1: Prepare Dataset

Step 2: Define Reward Function

Step 3: Create Training Config

Step 4: Launch Training

Step 5: Monitor and Validate

Workflow 2: PPO with Critic Model

Key Differences from GRPO

Configuration

Launch with Critic

Workflow 3: Large-Scale Training with Megatron

Prerequisites

Configuration for 70B+ Models

Launch Multi-Node

Configuration Reference

Algorithm Selection

Key Parameters

Common Issues and Solutions

Issue: OOM During Rollout

Issue: Training Instability

Issue: Slow Weight Sync

Issue: vLLM Version Mismatch

Advanced Topics

Multi-Turn Tool Calling

Vision-Language Models

LoRA Training

Resources

最新 Skills