⚠️

重要前提

安装AI Skills的关键前提是：必须科学上网，且开启TUN模式，这一点至关重要，直接决定安装能否顺利完成，在此郑重提醒三遍：科学上网，科学上网，科学上网。查看完整安装教程 →

torchforge：PyTorch原生强化学习库，实现快速RL算法实验与分布式训练

torchforge-rl-training by orchestra-research/ai-research-skills

106 周安装量

6,500 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/orchestra-research/ai-research-skills --skill torchforge-rl-training

AI/机器学习强化学习 PyTorch

🇨🇳中文介绍

torchforge：PyTorch 原生智能体强化学习库

torchforge 是 Meta 的 PyTorch 原生强化学习库，它将基础设施关注点与算法关注点分离。通过让您专注于算法，同时自动处理分布式训练、推理和权重同步，它实现了快速的强化学习研究。

何时使用 torchforge

在以下情况时选择 torchforge：

RL 算法与基础设施的清晰分离
PyTorch 原生抽象（无 Ray 依赖）
易于进行算法实验（GRPO、DAPO、SAPO 等约 100 行代码）
使用 Monarch 智能体系统进行可扩展训练
与 TorchTitan 集成以实现模型并行

在以下情况时考虑替代方案：

您需要生产就绪的稳定性 → 使用 miles 或 verl
您想要 Megatron 原生训练 → 使用 slime
torchforge 是实验性的，API 可能会更改

主要特性

算法隔离：无需接触基础设施即可实现 RL 算法
可扩展性：通过 Monarch 从单 GPU 扩展到数千个 GPU
现代技术栈：TorchTitan（训练）、vLLM（推理）、TorchStore（同步）
损失函数：内置 GRPO、DAPO、CISPO、GSPO、SAPO

架构概述

┌─────────────────────────────────────────────────────────┐
│ 应用层（您的代码）                                      │
│ - 定义奖励模型、损失函数、采样                          │
└─────────────────────┬───────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────┐
│ Forge API 层                                           │
│ - Episode、Group 数据类                                │
│ - 服务接口（async/await）                               │
└─────────────────────┬───────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────┐
│ 分布式服务（Monarch）                                   │
│ ├── 训练器（TorchTitan FSDP）                           │
│ ├── 生成器（vLLM 推理）                                 │
│ ├── 参考模型（冻结的 KL 基线）                          │
│ └── 奖励智能体（计算奖励）                              │
└─────────────────────────────────────────────────────────┘

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

SFT 训练（2+ GPUs）

python -m apps.sft.main --config apps/sft/llama3_8b.yaml

GRPO 训练（3+ GPUs）

python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml

工作流程 1：用于数学推理的 GRPO 训练

使用此工作流程训练具有组相对优势的推理模型。

3+ GPUs（GPU0：训练器，GPU1：参考模型，GPU2：生成器）
来自 HuggingFace Hub 的模型
训练数据集（GSM8K、MATH 等）

步骤 1：创建配置

# config/grpo_math.yaml
model: "Qwen/Qwen2.5-7B-Instruct"

dataset:
  path: "openai/gsm8k"
  split: "train"
  streaming: true

training:
  batch_size: 4
  learning_rate: 1e-6
  seq_len: 4096
  dtype: bfloat16
  gradient_accumulation_steps: 4

grpo:
  n_samples: 8           # 每个提示的响应数
  clip_low: 0.2
  clip_high: 0.28
  beta: 0.1              # KL 惩罚系数
  temperature: 0.7

services:
  generator:
    procs: 1
    num_replicas: 1
    with_gpus: true
  trainer:
    procs: 1
    num_replicas: 1
    with_gpus: true
  ref_model:
    procs: 1
    num_replicas: 1
    with_gpus: true

步骤 2：定义奖励函数

# rewards.py
# 奖励函数位于 forge.data.rewards 中
from forge.data.rewards import MathReward, ThinkingReward
import re

# 或者定义您自己的奖励函数
class CustomMathReward:
    def __call__(self, prompt: str, response: str, target: str) -> float:
        # 从响应中提取答案
        match = re.search(r'\\boxed{([^}]+)}', response)
        if not match:
            return 0.0

        answer = match.group(1).strip()
        return 1.0 if answer == target else 0.0

步骤 3：启动训练

python -m apps.grpo.main --config config/grpo_math.yaml

步骤 4：监控进度

检查 W&B 仪表板以查看损失曲线
验证熵是否在减少（策略变得更加确定性）
监控 KL 散度（应保持有界）

工作流程 2：自定义损失函数

使用此工作流程来实现新的 RL 算法。

步骤 1：创建损失类

# src/forge/losses/custom_loss.py
import torch
import torch.nn as nn

class CustomLoss(nn.Module):
    def __init__(self, clip_range: float = 0.2, beta: float = 0.1):
        super().__init__()
        self.clip_range = clip_range
        self.beta = beta

    def forward(
        self,
        logprobs: torch.Tensor,
        ref_logprobs: torch.Tensor,
        advantages: torch.Tensor,
        padding_mask: torch.Tensor,
    ) -> torch.Tensor:
        # 计算重要性比率
        ratio = torch.exp(logprobs - ref_logprobs)

        # 裁剪策略梯度
        clipped_ratio = torch.clamp(
            ratio,
            1 - self.clip_range,
            1 + self.clip_range
        )
        pg_loss = -torch.min(ratio * advantages, clipped_ratio * advantages)

        # KL 惩罚
        kl = ref_logprobs - logprobs

        # 应用掩码并聚合
        masked_loss = (pg_loss + self.beta * kl) * padding_mask
        loss = masked_loss.sum() / padding_mask.sum()

        return loss

步骤 2：集成到应用程序中

# apps/custom/main.py
from forge.losses.custom_loss import CustomLoss

loss_fn = CustomLoss(clip_range=0.2, beta=0.1)

# 在训练循环中
loss = loss_fn(
    logprobs=logprobs,
    ref_logprobs=ref_logprobs,
    advantages=advantages,
    padding_mask=padding_mask,
)

工作流程 3：多 GPU 分布式训练

使用此工作流程扩展到多个 GPU 或节点。

# config/distributed.yaml
model: "meta-llama/Meta-Llama-3.1-8B-Instruct"

parallelism:
  tensor_parallel_degree: 2    # 跨 GPU 拆分模型
  pipeline_parallel_degree: 1
  data_parallel_shard_degree: 2

services:
  generator:
    procs: 2                   # TP=2 的 2 个进程
    num_replicas: 1
    with_gpus: true
  trainer:
    procs: 2
    num_replicas: 1
    with_gpus: true

# 提交作业
sbatch --nodes=2 --gpus-per-node=8 run_grpo.sh

本地启动（多 GPU）

# 8 GPU 设置
python -m apps.grpo.main \
    --config config/distributed.yaml \
    --trainer.procs 4 \
    --generator.procs 4

torchforge 使用基于字典的批次进行训练：

# inputs: 包含 torch.Tensor 值的字典列表
inputs = [{"tokens": torch.Tensor}]

# targets: 包含训练信号的字典列表
targets = [{
    "response": torch.Tensor,
    "ref_logprobs": torch.Tensor,
    "advantages": torch.Tensor,
    "padding_mask": torch.Tensor
}]

# train_step 返回损失作为浮点数
loss = trainer.train_step(inputs, targets)

从 vLLM 生成的输出：

@dataclass
class Completion:
    text: str              # 生成的文本
    token_ids: list[int]   # 令牌 ID
    logprobs: list[float]  # 对数概率
    metadata: dict         # 自定义元数据

损失函数位于 forge.losses 模块中：

from forge.losses import SimpleGRPOLoss, ReinforceLoss

# 用于 GRPO 训练的 SimpleGRPOLoss
loss_fn = SimpleGRPOLoss(beta=0.1)

# 前向传播
loss = loss_fn(
    logprobs=logprobs,
    ref_logprobs=ref_logprobs,
    advantages=advantages,
    padding_mask=padding_mask
)

from forge.losses.reinforce_loss import ReinforceLoss

# 带有可选重要性比率裁剪
loss_fn = ReinforceLoss(clip_ratio=0.2)

常见问题及解决方案

症状："Insufficient GPU resources" 错误

# 减少服务需求
services:
  generator:
    procs: 1
    with_gpus: true
  trainer:
    procs: 1
    with_gpus: true
  # 移除 ref_model（使用生成器权重）

或者对参考模型使用 CPU：

ref_model:
  with_gpus: false

问题：生成期间内存不足

症状：vLLM 中出现 CUDA OOM

# 减少批次大小
grpo:
  n_samples: 4  # 从 8 减少

# 或者减少序列长度
training:
  seq_len: 2048

问题：权重同步缓慢

症状：训练和生成之间存在长时间停顿

# 启用 RDMA（如果可用）
export TORCHSTORE_USE_RDMA=1

# 或者减少同步频率
training:
  sync_interval: 10  # 每 10 步同步一次

问题：策略崩溃

症状：熵降至零，奖励停止改善

# 增加 KL 惩罚
grpo:
  beta: 0.2  # 从 0.1 增加

# 或者添加熵奖励
training:
  entropy_coef: 0.01

🇺🇸English

torchforge: PyTorch-Native Agentic RL Library

torchforge is Meta's PyTorch-native RL library that separates infrastructure concerns from algorithm concerns. It enables rapid RL research by letting you focus on algorithms while handling distributed training, inference, and weight sync automatically.

When to Use torchforge

Choose torchforge when you need:

Clean separation between RL algorithms and infrastructure
PyTorch-native abstractions (no Ray dependency)
Easy algorithm experimentation (GRPO, DAPO, SAPO in ~100 lines)
Scalable training with Monarch actor system
Integration with TorchTitan for model parallelism

Consider alternatives when:

You need production-ready stability → use miles or verl
You want Megatron-native training → use slime
torchforge is experimental and APIs may change

Key Features

Algorithm isolation : Implement RL algorithms without touching infrastructure
Scalability : From single GPU to thousands via Monarch
Modern stack : TorchTitan (training), vLLM (inference), TorchStore (sync)
Loss functions : GRPO, DAPO, CISPO, GSPO, SAPO built-in

Architecture Overview

┌─────────────────────────────────────────────────────────┐
│ Application Layer (Your Code)                           │
│ - Define reward models, loss functions, sampling        │
└─────────────────────┬───────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────┐
│ Forge API Layer                                         │
│ - Episode, Group dataclasses                           │
│ - Service interfaces (async/await)                      │
└─────────────────────┬───────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────┐
│ Distributed Services (Monarch)                          │
│ ├── Trainer (TorchTitan FSDP)                          │
│ ├── Generator (vLLM inference)                          │
│ ├── Reference Model (frozen KL baseline)               │
│ └── Reward Actors (compute rewards)                    │
└─────────────────────────────────────────────────────────┘

Installation

# Create environment
conda create -n forge python=3.12
conda activate forge

# Install (handles PyTorch nightly + dependencies)
./scripts/install.sh

# Verify
python -c "import torch, forge, vllm; print('OK')"

ROCm Installation

./scripts/install_rocm.sh

Quick Start

SFT Training (2+ GPUs)

python -m apps.sft.main --config apps/sft/llama3_8b.yaml

GRPO Training (3+ GPUs)

python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml

Workflow 1: GRPO Training for Math Reasoning

Use this workflow for training reasoning models with group-relative advantages.

Prerequisites Checklist

3+ GPUs (GPU0: trainer, GPU1: ref_model, GPU2: generator)
Model from HuggingFace Hub
Training dataset (GSM8K, MATH, etc.)

Step 1: Create Configuration

# config/grpo_math.yaml
model: "Qwen/Qwen2.5-7B-Instruct"

dataset:
  path: "openai/gsm8k"
  split: "train"
  streaming: true

training:
  batch_size: 4
  learning_rate: 1e-6
  seq_len: 4096
  dtype: bfloat16
  gradient_accumulation_steps: 4

grpo:
  n_samples: 8           # Responses per prompt
  clip_low: 0.2
  clip_high: 0.28
  beta: 0.1              # KL penalty coefficient
  temperature: 0.7

services:
  generator:
    procs: 1
    num_replicas: 1
    with_gpus: true
  trainer:
    procs: 1
    num_replicas: 1
    with_gpus: true
  ref_model:
    procs: 1
    num_replicas: 1
    with_gpus: true

Step 2: Define Reward Function

# rewards.py
# Reward functions are in forge.data.rewards
from forge.data.rewards import MathReward, ThinkingReward
import re

# Or define your own reward function
class CustomMathReward:
    def __call__(self, prompt: str, response: str, target: str) -> float:
        # Extract answer from response
        match = re.search(r'\\boxed{([^}]+)}', response)
        if not match:
            return 0.0

        answer = match.group(1).strip()
        return 1.0 if answer == target else 0.0

Step 3: Launch Training

python -m apps.grpo.main --config config/grpo_math.yaml

Step 4: Monitor Progress

Check W&B dashboard for loss curves
Verify entropy is decreasing (policy becoming more deterministic)
Monitor KL divergence (should stay bounded)

Workflow 2: Custom Loss Function

Use this workflow to implement new RL algorithms.

Step 1: Create Loss Class

# src/forge/losses/custom_loss.py
import torch
import torch.nn as nn

class CustomLoss(nn.Module):
    def __init__(self, clip_range: float = 0.2, beta: float = 0.1):
        super().__init__()
        self.clip_range = clip_range
        self.beta = beta

    def forward(
        self,
        logprobs: torch.Tensor,
        ref_logprobs: torch.Tensor,
        advantages: torch.Tensor,
        padding_mask: torch.Tensor,
    ) -> torch.Tensor:
        # Compute importance ratio
        ratio = torch.exp(logprobs - ref_logprobs)

        # Clipped policy gradient
        clipped_ratio = torch.clamp(
            ratio,
            1 - self.clip_range,
            1 + self.clip_range
        )
        pg_loss = -torch.min(ratio * advantages, clipped_ratio * advantages)

        # KL penalty
        kl = ref_logprobs - logprobs

        # Apply mask and aggregate
        masked_loss = (pg_loss + self.beta * kl) * padding_mask
        loss = masked_loss.sum() / padding_mask.sum()

        return loss

Step 2: Integrate into Application

# apps/custom/main.py
from forge.losses.custom_loss import CustomLoss

loss_fn = CustomLoss(clip_range=0.2, beta=0.1)

# In training loop
loss = loss_fn(
    logprobs=logprobs,
    ref_logprobs=ref_logprobs,
    advantages=advantages,
    padding_mask=padding_mask,
)

Workflow 3: Multi-GPU Distributed Training

Use this workflow for scaling to multiple GPUs or nodes.

Configuration for Distributed

# config/distributed.yaml
model: "meta-llama/Meta-Llama-3.1-8B-Instruct"

parallelism:
  tensor_parallel_degree: 2    # Split model across GPUs
  pipeline_parallel_degree: 1
  data_parallel_shard_degree: 2

services:
  generator:
    procs: 2                   # 2 processes for TP=2
    num_replicas: 1
    with_gpus: true
  trainer:
    procs: 2
    num_replicas: 1
    with_gpus: true

Launch with SLURM

# Submit job
sbatch --nodes=2 --gpus-per-node=8 run_grpo.sh

Launch Locally (Multi-GPU)

# 8 GPU setup
python -m apps.grpo.main \
    --config config/distributed.yaml \
    --trainer.procs 4 \
    --generator.procs 4

Core API Reference

Training Batch Format

torchforge uses dictionary-based batches for training:

# inputs: list of dicts with torch.Tensor values
inputs = [{"tokens": torch.Tensor}]

# targets: list of dicts with training signals
targets = [{
    "response": torch.Tensor,
    "ref_logprobs": torch.Tensor,
    "advantages": torch.Tensor,
    "padding_mask": torch.Tensor
}]

# train_step returns loss as float
loss = trainer.train_step(inputs, targets)

Completion

Generated output from vLLM:

@dataclass
class Completion:
    text: str              # Generated text
    token_ids: list[int]   # Token IDs
    logprobs: list[float]  # Log probabilities
    metadata: dict         # Custom metadata

Built-in Loss Functions

Loss Functions

Loss functions are in the forge.losses module:

from forge.losses import SimpleGRPOLoss, ReinforceLoss

# SimpleGRPOLoss for GRPO training
loss_fn = SimpleGRPOLoss(beta=0.1)

# Forward pass
loss = loss_fn(
    logprobs=logprobs,
    ref_logprobs=ref_logprobs,
    advantages=advantages,
    padding_mask=padding_mask
)

ReinforceLoss

from forge.losses.reinforce_loss import ReinforceLoss

# With optional importance ratio clipping
loss_fn = ReinforceLoss(clip_ratio=0.2)

Common Issues and Solutions

Issue: Not Enough GPUs

Symptoms : "Insufficient GPU resources" error

Solutions :

# Reduce service requirements
services:
  generator:
    procs: 1
    with_gpus: true
  trainer:
    procs: 1
    with_gpus: true
  # Remove ref_model (uses generator weights)

Or use CPU for reference model:

ref_model:
  with_gpus: false

Issue: OOM During Generation

Symptoms : CUDA OOM in vLLM

Solutions :

# Reduce batch size
grpo:
  n_samples: 4  # Reduce from 8

# Or reduce sequence length
training:
  seq_len: 2048

Issue: Slow Weight Sync

Symptoms : Long pauses between training and generation

Solutions :

# Enable RDMA (if available)
export TORCHSTORE_USE_RDMA=1

# Or reduce sync frequency
training:
  sync_interval: 10  # Sync every 10 steps

Issue: Policy Collapse

Symptoms : Entropy drops to zero, reward stops improving

Solutions :

# Increase KL penalty
grpo:
  beta: 0.2  # Increase from 0.1

# Or add entropy bonus
training:
  entropy_coef: 0.01

Resources

Documentation : https://meta-pytorch.org/torchforge
GitHub : https://github.com/meta-pytorch/torchforge
Discord : https://discord.gg/YsTYBh6PD9
TorchTitan : https://github.com/pytorch/torchtitan
Monarch : https://github.com/meta-pytorch/monarch

Weekly Installs

Repository

orchestra-resea…h-skills

GitHub Stars

5.5K

First Seen

Feb 7, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

codex55

opencode55

cursor55

gemini-cli54

github-copilot53

claude-code52

超能力技能使用指南：AI助手技能调用优先级与工作流程详解

56,600 周安装

torchforge：PyTorch原生强化学习库，实现快速RL算法实验与分布式训练

🇨🇳中文介绍

torchforge：PyTorch 原生智能体强化学习库

何时使用 torchforge

主要特性

架构概述

相关 Skills

安装

ROCm 安装

快速开始

SFT 训练（2+ GPUs）

GRPO 训练（3+ GPUs）

工作流程 1：用于数学推理的 GRPO 训练

先决条件清单

步骤 1：创建配置

步骤 2：定义奖励函数

步骤 3：启动训练

步骤 4：监控进度

工作流程 2：自定义损失函数

步骤 1：创建损失类

步骤 2：集成到应用程序中

工作流程 3：多 GPU 分布式训练

分布式配置

使用 SLURM 启动

本地启动（多 GPU）

核心 API 参考

训练批次格式

完成

内置损失函数

损失函数

ReinforceLoss

常见问题及解决方案

问题：GPU 不足

问题：生成期间内存不足

问题：权重同步缓慢

问题：策略崩溃

资源

🇺🇸English

torchforge: PyTorch-Native Agentic RL Library

When to Use torchforge

Key Features

Architecture Overview

Installation

ROCm Installation

Quick Start

SFT Training (2+ GPUs)

GRPO Training (3+ GPUs)

Workflow 1: GRPO Training for Math Reasoning

Prerequisites Checklist

Step 1: Create Configuration

Step 2: Define Reward Function

Step 3: Launch Training

Step 4: Monitor Progress

Workflow 2: Custom Loss Function

Step 1: Create Loss Class

Step 2: Integrate into Application

Workflow 3: Multi-GPU Distributed Training

Configuration for Distributed

Launch with SLURM

Launch Locally (Multi-GPU)

Core API Reference

Training Batch Format

Completion

Built-in Loss Functions

Loss Functions

ReinforceLoss

Common Issues and Solutions

Issue: Not Enough GPUs

Issue: OOM During Generation

Issue: Slow Weight Sync

Issue: Policy Collapse

Resources

最新 Skills