OpenRLHF高性能RLHF训练框架：基于Ray的分布式训练与vLLM推理加速

openrlhf-training by davila7/claude-code-templates

206 周安装量

24,200 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/davila7/claude-code-templates --skill openrlhf-training

AI/机器学习性能优化分布式系统

🇨🇳中文介绍

OpenRLHF - 高性能 RLHF 训练

快速开始

OpenRLHF 是一个基于 Ray 的 RLHF 框架，针对分布式训练进行了优化，并采用 vLLM 进行推理加速。

安装：

# 启动 Docker 容器
docker run --runtime=nvidia -it --rm --shm-size="10g" --cap-add=SYS_ADMIN \
  -v $PWD:/openrlhf nvcr.io/nvidia/pytorch:25.02-py3 bash

# 卸载冲突包
sudo pip uninstall xgboost transformer_engine flash_attn pynvml -y

# 安装带 vLLM 的 OpenRLHF
pip install openrlhf[vllm]

PPO 训练（混合引擎）：

ray start --head --node-ip-address 0.0.0.0 --num-gpus 8

ray job submit --address="http://127.0.0.1:8265" \
  --runtime-env-json='{"working_dir": "/openrlhf"}' \
  -- python3 -m openrlhf.cli.train_ppo_ray \
  --ref_num_nodes 1 --ref_num_gpus_per_node 8 \
  --reward_num_nodes 1 --reward_num_gpus_per_node 8 \
  --critic_num_nodes 1 --critic_num_gpus_per_node 8 \
  --actor_num_nodes 1 --actor_num_gpus_per_node 8 \
  --vllm_num_engines 4 --vllm_tensor_parallel_size 2 \
  --colocate_all_models \
  --vllm_gpu_memory_utilization 0.5 \
  --pretrain OpenRLHF/Llama-3-8b-sft-mixture \
  --reward_pretrain OpenRLHF/Llama-3-8b-rm-700k \
  --save_path ./output/llama3-8b-rlhf \
  --micro_train_batch_size 8 --train_batch_size 128 \
  --micro_rollout_batch_size 16 --rollout_batch_size 1024 \
  --max_epochs 1 --prompt_max_len 1024 --generate_max_len 1024 \
  --zero_stage 3 --bf16 \
  --actor_learning_rate 5e-7 --critic_learning_rate 9e-6 \
  --init_kl_coef 0.01 --normalize_reward \
  --gradient_checkpointing --packing_samples \
  --vllm_enable_sleep --deepspeed_enable_sleep

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

工作流 1：完整的 RLHF 流程（SFT → 奖励模型 → PPO）

步骤 1：训练奖励模型（DPO）：

deepspeed --module openrlhf.cli.train_rm \
  --save_path ./output/llama3-8b-rm \
  --save_steps -1 --logging_steps 1 \
  --eval_steps -1 --train_batch_size 256 \
  --micro_train_batch_size 1 --pretrain meta-llama/Meta-Llama-3-8B \
  --bf16 --max_epochs 1 --max_len 8192 \
  --zero_stage 3 --learning_rate 9e-6 \
  --dataset OpenRLHF/preference_dataset_mixture2_and_safe_pku \
  --apply_chat_template --chosen_key chosen \
  --rejected_key rejected --flash_attn --gradient_checkpointing

步骤 2：PPO 训练：

ray start --head --node-ip-address 0.0.0.0 --num-gpus 8

ray job submit --address="http://127.0.0.1:8265" \
  -- python3 -m openrlhf.cli.train_ppo_ray \
  --ref_num_nodes 1 --ref_num_gpus_per_node 8 \
  --reward_num_nodes 1 --reward_num_gpus_per_node 8 \
  --critic_num_nodes 1 --critic_num_gpus_per_node 8 \
  --actor_num_nodes 1 --actor_num_gpus_per_node 8 \
  --vllm_num_engines 4 --vllm_tensor_parallel_size 2 \
  --colocate_all_models \
  --pretrain OpenRLHF/Llama-3-8b-sft-mixture \
  --reward_pretrain ./output/llama3-8b-rm \
  --save_path ./output/llama3-8b-ppo \
  --micro_train_batch_size 8 --train_batch_size 128 \
  --micro_rollout_batch_size 16 --rollout_batch_size 1024 \
  --max_epochs 1 --prompt_max_len 1024 --generate_max_len 1024 \
  --zero_stage 3 --bf16 \
  --actor_learning_rate 5e-7 --critic_learning_rate 9e-6 \
  --init_kl_coef 0.01 --normalize_reward \
  --vllm_enable_sleep --deepspeed_enable_sleep

工作流 2：GRPO 训练（无需评论家模型）

PPO 的内存高效替代方案：

ray job submit --address="http://127.0.0.1:8265" \
  -- python3 -m openrlhf.cli.train_ppo_ray \
  --advantage_estimator group_norm \
  --ref_num_nodes 1 --ref_num_gpus_per_node 8 \
  --reward_num_nodes 1 --reward_num_gpus_per_node 8 \
  --actor_num_nodes 1 --actor_num_gpus_per_node 8 \
  --vllm_num_engines 4 --vllm_tensor_parallel_size 2 \
  --colocate_all_models \
  --pretrain OpenRLHF/Llama-3-8b-sft-mixture \
  --reward_pretrain OpenRLHF/Llama-3-8b-rm-700k \
  --save_path ./output/llama3-8b-grpo \
  --micro_train_batch_size 8 --train_batch_size 128 \
  --micro_rollout_batch_size 16 --rollout_batch_size 1024 \
  --max_epochs 1 --bf16 \
  --actor_learning_rate 5e-7 \
  --init_kl_coef 0.01 --use_kl_loss --kl_estimator k3 \
  --normalize_reward --no_advantage_std_norm

关键 GRPO 参数：

--advantage_estimator group_norm - 启用 GRPO
--use_kl_loss - GRPO 论文中的 KL 损失
--kl_estimator k3 - 损失函数（k2 ≈ k1）
--no_advantage_std_norm - 禁用标准差归一化

工作流 3：DPO 训练（偏好优化）

无需奖励模型的更简单替代方案：

deepspeed --module openrlhf.cli.train_dpo \
  --save_path ./output/llama3-8b-dpo \
  --save_steps -1 --logging_steps 1 \
  --eval_steps -1 --train_batch_size 256 \
  --micro_train_batch_size 2 --pretrain meta-llama/Meta-Llama-3-8B \
  --bf16 --max_epochs 1 --max_len 8192 \
  --zero_stage 3 --learning_rate 5e-7 --beta 0.1 \
  --dataset OpenRLHF/preference_dataset_mixture2_and_safe_pku \
  --apply_chat_template --chosen_key chosen \
  --rejected_key rejected --flash_attn --gradient_checkpointing

使用场景与替代方案对比

在以下情况下使用 OpenRLHF：

使用 RL 训练大型模型（7B-70B+）
需要 vLLM 推理加速
希望使用 Ray 的分布式架构
拥有多节点 GPU 集群
需要一个框架内包含 PPO/GRPO/RLOO/DPO

PPO：控制力最强，最适合复杂奖励
GRPO：内存高效，无需评论家模型
RLOO：改进的 PPO，具有逐令牌 KL
REINFORCE++：比 GRPO 更稳定，比 PPO 更快
DPO：最简单，无需奖励模型

在以下情况下使用替代方案：

TRL：单节点训练，API 更简单
veRL：字节跳动的 671B 模型框架
DeepSpeedChat：与 DeepSpeed 生态系统集成

问题：大型模型导致 GPU 内存不足

禁用模型共置：

# 移除 --colocate_all_models 标志
# 为每个模型分配独立的 GPU
--actor_num_gpus_per_node 8 \
--critic_num_gpus_per_node 8 \
--reward_num_gpus_per_node 8 \
--ref_num_gpus_per_node 8

问题：DeepSpeed GPU 索引越界

设置环境变量：

export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1

问题：训练不稳定

使用混合引擎而非异步模式：

--colocate_all_models \
--vllm_enable_sleep \
--deepspeed_enable_sleep

--init_kl_coef 0.05  # 从 0.01 增加

问题：PPO 期间生成速度慢

启用 vLLM 加速：

--vllm_num_engines 4 \
--vllm_tensor_parallel_size 2 \
--vllm_gpu_memory_utilization 0.5

混合引擎 GPU 共享：有关 vLLM 休眠模式、DeepSpeed 休眠模式和最佳节点分配，请参阅 references/hybrid-engine.md。

算法比较：有关 PPO、GRPO、RLOO、REINFORCE++ 的基准测试和超参数，请参阅 references/algorithm-comparison.md。

多节点设置：有关 Ray 集群配置和容错性，请参阅 references/multi-node-training.md。

自定义奖励函数：有关强化微调和智能体 RLHF，请参阅 references/custom-rewards.md。

GPU：推荐 NVIDIA A100/H100
显存：
- 7B 模型：8× A100 40GB（混合引擎）
- 70B 模型：48× A100 80GB（vLLM:演员:评论家 = 1:1:1）
多节点：推荐使用带 InfiniBand 的 Ray 集群
Docker：NVIDIA PyTorch 容器 25.02+

比 DeepSpeedChat 快 2 倍
vLLM 推理加速
混合引擎最小化 GPU 空闲时间

🇺🇸English

OpenRLHF - High-Performance RLHF Training

Quick start

OpenRLHF is a Ray-based RLHF framework optimized for distributed training with vLLM inference acceleration.

Installation :

# Launch Docker container
docker run --runtime=nvidia -it --rm --shm-size="10g" --cap-add=SYS_ADMIN \
  -v $PWD:/openrlhf nvcr.io/nvidia/pytorch:25.02-py3 bash

# Uninstall conflicts
sudo pip uninstall xgboost transformer_engine flash_attn pynvml -y

# Install OpenRLHF with vLLM
pip install openrlhf[vllm]

PPO Training (Hybrid Engine):

ray start --head --node-ip-address 0.0.0.0 --num-gpus 8

ray job submit --address="http://127.0.0.1:8265" \
  --runtime-env-json='{"working_dir": "/openrlhf"}' \
  -- python3 -m openrlhf.cli.train_ppo_ray \
  --ref_num_nodes 1 --ref_num_gpus_per_node 8 \
  --reward_num_nodes 1 --reward_num_gpus_per_node 8 \
  --critic_num_nodes 1 --critic_num_gpus_per_node 8 \
  --actor_num_nodes 1 --actor_num_gpus_per_node 8 \
  --vllm_num_engines 4 --vllm_tensor_parallel_size 2 \
  --colocate_all_models \
  --vllm_gpu_memory_utilization 0.5 \
  --pretrain OpenRLHF/Llama-3-8b-sft-mixture \
  --reward_pretrain OpenRLHF/Llama-3-8b-rm-700k \
  --save_path ./output/llama3-8b-rlhf \
  --micro_train_batch_size 8 --train_batch_size 128 \
  --micro_rollout_batch_size 16 --rollout_batch_size 1024 \
  --max_epochs 1 --prompt_max_len 1024 --generate_max_len 1024 \
  --zero_stage 3 --bf16 \
  --actor_learning_rate 5e-7 --critic_learning_rate 9e-6 \
  --init_kl_coef 0.01 --normalize_reward \
  --gradient_checkpointing --packing_samples \
  --vllm_enable_sleep --deepspeed_enable_sleep

GRPO Training (Group Normalized Policy Optimization):

# Same command as PPO, but add:
--advantage_estimator group_norm

Common workflows

Workflow 1: Full RLHF pipeline (SFT → Reward Model → PPO)

Step 1: Train reward model (DPO):

deepspeed --module openrlhf.cli.train_rm \
  --save_path ./output/llama3-8b-rm \
  --save_steps -1 --logging_steps 1 \
  --eval_steps -1 --train_batch_size 256 \
  --micro_train_batch_size 1 --pretrain meta-llama/Meta-Llama-3-8B \
  --bf16 --max_epochs 1 --max_len 8192 \
  --zero_stage 3 --learning_rate 9e-6 \
  --dataset OpenRLHF/preference_dataset_mixture2_and_safe_pku \
  --apply_chat_template --chosen_key chosen \
  --rejected_key rejected --flash_attn --gradient_checkpointing

Step 2: PPO training :

ray start --head --node-ip-address 0.0.0.0 --num-gpus 8

ray job submit --address="http://127.0.0.1:8265" \
  -- python3 -m openrlhf.cli.train_ppo_ray \
  --ref_num_nodes 1 --ref_num_gpus_per_node 8 \
  --reward_num_nodes 1 --reward_num_gpus_per_node 8 \
  --critic_num_nodes 1 --critic_num_gpus_per_node 8 \
  --actor_num_nodes 1 --actor_num_gpus_per_node 8 \
  --vllm_num_engines 4 --vllm_tensor_parallel_size 2 \
  --colocate_all_models \
  --pretrain OpenRLHF/Llama-3-8b-sft-mixture \
  --reward_pretrain ./output/llama3-8b-rm \
  --save_path ./output/llama3-8b-ppo \
  --micro_train_batch_size 8 --train_batch_size 128 \
  --micro_rollout_batch_size 16 --rollout_batch_size 1024 \
  --max_epochs 1 --prompt_max_len 1024 --generate_max_len 1024 \
  --zero_stage 3 --bf16 \
  --actor_learning_rate 5e-7 --critic_learning_rate 9e-6 \
  --init_kl_coef 0.01 --normalize_reward \
  --vllm_enable_sleep --deepspeed_enable_sleep

Workflow 2: GRPO training (no critic model needed)

Memory-efficient alternative to PPO:

ray job submit --address="http://127.0.0.1:8265" \
  -- python3 -m openrlhf.cli.train_ppo_ray \
  --advantage_estimator group_norm \
  --ref_num_nodes 1 --ref_num_gpus_per_node 8 \
  --reward_num_nodes 1 --reward_num_gpus_per_node 8 \
  --actor_num_nodes 1 --actor_num_gpus_per_node 8 \
  --vllm_num_engines 4 --vllm_tensor_parallel_size 2 \
  --colocate_all_models \
  --pretrain OpenRLHF/Llama-3-8b-sft-mixture \
  --reward_pretrain OpenRLHF/Llama-3-8b-rm-700k \
  --save_path ./output/llama3-8b-grpo \
  --micro_train_batch_size 8 --train_batch_size 128 \
  --micro_rollout_batch_size 16 --rollout_batch_size 1024 \
  --max_epochs 1 --bf16 \
  --actor_learning_rate 5e-7 \
  --init_kl_coef 0.01 --use_kl_loss --kl_estimator k3 \
  --normalize_reward --no_advantage_std_norm

Key GRPO parameters :

--advantage_estimator group_norm - Enables GRPO
--use_kl_loss - KL loss from GRPO paper
--kl_estimator k3 - Loss function (k2 ≈ k1)
--no_advantage_std_norm - Disables std normalization

Workflow 3: DPO training (preference optimization)

Simpler alternative without reward model:

deepspeed --module openrlhf.cli.train_dpo \
  --save_path ./output/llama3-8b-dpo \
  --save_steps -1 --logging_steps 1 \
  --eval_steps -1 --train_batch_size 256 \
  --micro_train_batch_size 2 --pretrain meta-llama/Meta-Llama-3-8B \
  --bf16 --max_epochs 1 --max_len 8192 \
  --zero_stage 3 --learning_rate 5e-7 --beta 0.1 \
  --dataset OpenRLHF/preference_dataset_mixture2_and_safe_pku \
  --apply_chat_template --chosen_key chosen \
  --rejected_key rejected --flash_attn --gradient_checkpointing

When to use vs alternatives

Use OpenRLHF when :

Training large models (7B-70B+) with RL
Need vLLM inference acceleration
Want distributed architecture with Ray
Have multi-node GPU cluster
Need PPO/GRPO/RLOO/DPO in one framework

Algorithm selection :

PPO : Maximum control, best for complex rewards
GRPO : Memory-efficient, no critic needed
RLOO : Modified PPO with per-token KL
REINFORCE++ : More stable than GRPO, faster than PPO
DPO : Simplest, no reward model needed

Use alternatives instead :

TRL : Single-node training, simpler API
veRL : ByteDance's framework for 671B models
DeepSpeedChat : Integrated with DeepSpeed ecosystem

Common issues

Issue: GPU OOM with large models

Disable model colocation:

# Remove --colocate_all_models flag
# Allocate separate GPUs for each model
--actor_num_gpus_per_node 8 \
--critic_num_gpus_per_node 8 \
--reward_num_gpus_per_node 8 \
--ref_num_gpus_per_node 8

Issue: DeepSpeed GPU index out of range

Set environment variable:

export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1

Issue: Training instability

Use Hybrid Engine instead of async:

--colocate_all_models \
--vllm_enable_sleep \
--deepspeed_enable_sleep

Adjust KL coefficient:

--init_kl_coef 0.05  # Increase from 0.01

Issue: Slow generation during PPO

Enable vLLM acceleration:

--vllm_num_engines 4 \
--vllm_tensor_parallel_size 2 \
--vllm_gpu_memory_utilization 0.5

Advanced topics

Hybrid Engine GPU sharing : See references/hybrid-engine.md for vLLM sleep mode, DeepSpeed sleep mode, and optimal node allocation.

Algorithm comparison : See references/algorithm-comparison.md for PPO vs GRPO vs RLOO vs REINFORCE++ benchmarks and hyperparameters.

Multi-node setup : See references/multi-node-training.md for Ray cluster configuration and fault tolerance.

Custom reward functions : See references/custom-rewards.md for reinforced fine-tuning and agent RLHF.

Hardware requirements

GPU : NVIDIA A100/H100 recommended
VRAM :
- 7B model: 8× A100 40GB (Hybrid Engine)
- 70B model: 48× A100 80GB (vLLM:Actor:Critic = 1:1:1)
Multi-node : Ray cluster with InfiniBand recommended
Docker : NVIDIA PyTorch container 25.02+

Performance :

2× faster than DeepSpeedChat
vLLM inference acceleration
Hybrid Engine minimizes GPU idle time

Resources

Docs: https://github.com/OpenRLHF/OpenRLHF
Paper: https://arxiv.org/abs/2405.11143
Examples: https://github.com/OpenRLHF/OpenRLHF/tree/main/examples
Discord: Community support

Weekly Installs

161

Repository

davila7/claude-…emplates

GitHub Stars

23.4K

First Seen

Jan 21, 2026

Security Audits

Gen Agent Trust HubPass SocketWarn SnykWarn

Installed on

claude-code137

opencode133

gemini-cli124

cursor124

codex113

antigravity112

AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具

50,900 周安装