重要前提
安装AI Skills的关键前提是:必须科学上网,且开启TUN模式,这一点至关重要,直接决定安装能否顺利完成,在此郑重提醒三遍:科学上网,科学上网,科学上网。查看完整安装教程 →
torchforge-rl-training by orchestra-research/ai-research-skills
npx skills add https://github.com/orchestra-research/ai-research-skills --skill torchforge-rl-trainingtorchforge 是 Meta 的 PyTorch 原生强化学习库,它将基础设施关注点与算法关注点分离。通过让您专注于算法,同时自动处理分布式训练、推理和权重同步,它实现了快速的强化学习研究。
在以下情况时选择 torchforge:
在以下情况时考虑替代方案:
┌─────────────────────────────────────────────────────────┐
│ 应用层(您的代码) │
│ - 定义奖励模型、损失函数、采样 │
└─────────────────────┬───────────────────────────────────┘
│
┌─────────────────────▼───────────────────────────────────┐
│ Forge API 层 │
│ - Episode、Group 数据类 │
│ - 服务接口(async/await) │
└─────────────────────┬───────────────────────────────────┘
│
┌─────────────────────▼───────────────────────────────────┐
│ 分布式服务(Monarch) │
│ ├── 训练器(TorchTitan FSDP) │
│ ├── 生成器(vLLM 推理) │
│ ├── 参考模型(冻结的 KL 基线) │
│ └── 奖励智能体(计算奖励) │
└─────────────────────────────────────────────────────────┘
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
# 创建环境
conda create -n forge python=3.12
conda activate forge
# 安装(处理 PyTorch nightly + 依赖项)
./scripts/install.sh
# 验证
python -c "import torch, forge, vllm; print('OK')"
./scripts/install_rocm.sh
python -m apps.sft.main --config apps/sft/llama3_8b.yaml
python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml
使用此工作流程训练具有组相对优势的推理模型。
# config/grpo_math.yaml
model: "Qwen/Qwen2.5-7B-Instruct"
dataset:
path: "openai/gsm8k"
split: "train"
streaming: true
training:
batch_size: 4
learning_rate: 1e-6
seq_len: 4096
dtype: bfloat16
gradient_accumulation_steps: 4
grpo:
n_samples: 8 # 每个提示的响应数
clip_low: 0.2
clip_high: 0.28
beta: 0.1 # KL 惩罚系数
temperature: 0.7
services:
generator:
procs: 1
num_replicas: 1
with_gpus: true
trainer:
procs: 1
num_replicas: 1
with_gpus: true
ref_model:
procs: 1
num_replicas: 1
with_gpus: true
# rewards.py
# 奖励函数位于 forge.data.rewards 中
from forge.data.rewards import MathReward, ThinkingReward
import re
# 或者定义您自己的奖励函数
class CustomMathReward:
def __call__(self, prompt: str, response: str, target: str) -> float:
# 从响应中提取答案
match = re.search(r'\\boxed{([^}]+)}', response)
if not match:
return 0.0
answer = match.group(1).strip()
return 1.0 if answer == target else 0.0
python -m apps.grpo.main --config config/grpo_math.yaml
使用此工作流程来实现新的 RL 算法。
# src/forge/losses/custom_loss.py
import torch
import torch.nn as nn
class CustomLoss(nn.Module):
def __init__(self, clip_range: float = 0.2, beta: float = 0.1):
super().__init__()
self.clip_range = clip_range
self.beta = beta
def forward(
self,
logprobs: torch.Tensor,
ref_logprobs: torch.Tensor,
advantages: torch.Tensor,
padding_mask: torch.Tensor,
) -> torch.Tensor:
# 计算重要性比率
ratio = torch.exp(logprobs - ref_logprobs)
# 裁剪策略梯度
clipped_ratio = torch.clamp(
ratio,
1 - self.clip_range,
1 + self.clip_range
)
pg_loss = -torch.min(ratio * advantages, clipped_ratio * advantages)
# KL 惩罚
kl = ref_logprobs - logprobs
# 应用掩码并聚合
masked_loss = (pg_loss + self.beta * kl) * padding_mask
loss = masked_loss.sum() / padding_mask.sum()
return loss
# apps/custom/main.py
from forge.losses.custom_loss import CustomLoss
loss_fn = CustomLoss(clip_range=0.2, beta=0.1)
# 在训练循环中
loss = loss_fn(
logprobs=logprobs,
ref_logprobs=ref_logprobs,
advantages=advantages,
padding_mask=padding_mask,
)
使用此工作流程扩展到多个 GPU 或节点。
# config/distributed.yaml
model: "meta-llama/Meta-Llama-3.1-8B-Instruct"
parallelism:
tensor_parallel_degree: 2 # 跨 GPU 拆分模型
pipeline_parallel_degree: 1
data_parallel_shard_degree: 2
services:
generator:
procs: 2 # TP=2 的 2 个进程
num_replicas: 1
with_gpus: true
trainer:
procs: 2
num_replicas: 1
with_gpus: true
# 提交作业
sbatch --nodes=2 --gpus-per-node=8 run_grpo.sh
# 8 GPU 设置
python -m apps.grpo.main \
--config config/distributed.yaml \
--trainer.procs 4 \
--generator.procs 4
torchforge 使用基于字典的批次进行训练:
# inputs: 包含 torch.Tensor 值的字典列表
inputs = [{"tokens": torch.Tensor}]
# targets: 包含训练信号的字典列表
targets = [{
"response": torch.Tensor,
"ref_logprobs": torch.Tensor,
"advantages": torch.Tensor,
"padding_mask": torch.Tensor
}]
# train_step 返回损失作为浮点数
loss = trainer.train_step(inputs, targets)
从 vLLM 生成的输出:
@dataclass
class Completion:
text: str # 生成的文本
token_ids: list[int] # 令牌 ID
logprobs: list[float] # 对数概率
metadata: dict # 自定义元数据
损失函数位于 forge.losses 模块中:
from forge.losses import SimpleGRPOLoss, ReinforceLoss
# 用于 GRPO 训练的 SimpleGRPOLoss
loss_fn = SimpleGRPOLoss(beta=0.1)
# 前向传播
loss = loss_fn(
logprobs=logprobs,
ref_logprobs=ref_logprobs,
advantages=advantages,
padding_mask=padding_mask
)
from forge.losses.reinforce_loss import ReinforceLoss
# 带有可选重要性比率裁剪
loss_fn = ReinforceLoss(clip_ratio=0.2)
症状:"Insufficient GPU resources" 错误
解决方案:
# 减少服务需求
services:
generator:
procs: 1
with_gpus: true
trainer:
procs: 1
with_gpus: true
# 移除 ref_model(使用生成器权重)
或者对参考模型使用 CPU:
ref_model:
with_gpus: false
症状:vLLM 中出现 CUDA OOM
解决方案:
# 减少批次大小
grpo:
n_samples: 4 # 从 8 减少
# 或者减少序列长度
training:
seq_len: 2048
症状:训练和生成之间存在长时间停顿
解决方案:
# 启用 RDMA(如果可用)
export TORCHSTORE_USE_RDMA=1
# 或者减少同步频率
training:
sync_interval: 10 # 每 10 步同步一次
症状:熵降至零,奖励停止改善
解决方案:
# 增加 KL 惩罚
grpo:
beta: 0.2 # 从 0.1 增加
# 或者添加熵奖励
training:
entropy_coef: 0.01
每周安装量
64
代码库
GitHub Stars
5.5K
首次出现
2026年2月7日
安全审计
安装于
codex55
opencode55
cursor55
gemini-cli54
github-copilot53
claude-code52
torchforge is Meta's PyTorch-native RL library that separates infrastructure concerns from algorithm concerns. It enables rapid RL research by letting you focus on algorithms while handling distributed training, inference, and weight sync automatically.
Choose torchforge when you need:
Consider alternatives when:
┌─────────────────────────────────────────────────────────┐
│ Application Layer (Your Code) │
│ - Define reward models, loss functions, sampling │
└─────────────────────┬───────────────────────────────────┘
│
┌─────────────────────▼───────────────────────────────────┐
│ Forge API Layer │
│ - Episode, Group dataclasses │
│ - Service interfaces (async/await) │
└─────────────────────┬───────────────────────────────────┘
│
┌─────────────────────▼───────────────────────────────────┐
│ Distributed Services (Monarch) │
│ ├── Trainer (TorchTitan FSDP) │
│ ├── Generator (vLLM inference) │
│ ├── Reference Model (frozen KL baseline) │
│ └── Reward Actors (compute rewards) │
└─────────────────────────────────────────────────────────┘
# Create environment
conda create -n forge python=3.12
conda activate forge
# Install (handles PyTorch nightly + dependencies)
./scripts/install.sh
# Verify
python -c "import torch, forge, vllm; print('OK')"
./scripts/install_rocm.sh
python -m apps.sft.main --config apps/sft/llama3_8b.yaml
python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml
Use this workflow for training reasoning models with group-relative advantages.
# config/grpo_math.yaml
model: "Qwen/Qwen2.5-7B-Instruct"
dataset:
path: "openai/gsm8k"
split: "train"
streaming: true
training:
batch_size: 4
learning_rate: 1e-6
seq_len: 4096
dtype: bfloat16
gradient_accumulation_steps: 4
grpo:
n_samples: 8 # Responses per prompt
clip_low: 0.2
clip_high: 0.28
beta: 0.1 # KL penalty coefficient
temperature: 0.7
services:
generator:
procs: 1
num_replicas: 1
with_gpus: true
trainer:
procs: 1
num_replicas: 1
with_gpus: true
ref_model:
procs: 1
num_replicas: 1
with_gpus: true
# rewards.py
# Reward functions are in forge.data.rewards
from forge.data.rewards import MathReward, ThinkingReward
import re
# Or define your own reward function
class CustomMathReward:
def __call__(self, prompt: str, response: str, target: str) -> float:
# Extract answer from response
match = re.search(r'\\boxed{([^}]+)}', response)
if not match:
return 0.0
answer = match.group(1).strip()
return 1.0 if answer == target else 0.0
python -m apps.grpo.main --config config/grpo_math.yaml
Use this workflow to implement new RL algorithms.
# src/forge/losses/custom_loss.py
import torch
import torch.nn as nn
class CustomLoss(nn.Module):
def __init__(self, clip_range: float = 0.2, beta: float = 0.1):
super().__init__()
self.clip_range = clip_range
self.beta = beta
def forward(
self,
logprobs: torch.Tensor,
ref_logprobs: torch.Tensor,
advantages: torch.Tensor,
padding_mask: torch.Tensor,
) -> torch.Tensor:
# Compute importance ratio
ratio = torch.exp(logprobs - ref_logprobs)
# Clipped policy gradient
clipped_ratio = torch.clamp(
ratio,
1 - self.clip_range,
1 + self.clip_range
)
pg_loss = -torch.min(ratio * advantages, clipped_ratio * advantages)
# KL penalty
kl = ref_logprobs - logprobs
# Apply mask and aggregate
masked_loss = (pg_loss + self.beta * kl) * padding_mask
loss = masked_loss.sum() / padding_mask.sum()
return loss
# apps/custom/main.py
from forge.losses.custom_loss import CustomLoss
loss_fn = CustomLoss(clip_range=0.2, beta=0.1)
# In training loop
loss = loss_fn(
logprobs=logprobs,
ref_logprobs=ref_logprobs,
advantages=advantages,
padding_mask=padding_mask,
)
Use this workflow for scaling to multiple GPUs or nodes.
# config/distributed.yaml
model: "meta-llama/Meta-Llama-3.1-8B-Instruct"
parallelism:
tensor_parallel_degree: 2 # Split model across GPUs
pipeline_parallel_degree: 1
data_parallel_shard_degree: 2
services:
generator:
procs: 2 # 2 processes for TP=2
num_replicas: 1
with_gpus: true
trainer:
procs: 2
num_replicas: 1
with_gpus: true
# Submit job
sbatch --nodes=2 --gpus-per-node=8 run_grpo.sh
# 8 GPU setup
python -m apps.grpo.main \
--config config/distributed.yaml \
--trainer.procs 4 \
--generator.procs 4
torchforge uses dictionary-based batches for training:
# inputs: list of dicts with torch.Tensor values
inputs = [{"tokens": torch.Tensor}]
# targets: list of dicts with training signals
targets = [{
"response": torch.Tensor,
"ref_logprobs": torch.Tensor,
"advantages": torch.Tensor,
"padding_mask": torch.Tensor
}]
# train_step returns loss as float
loss = trainer.train_step(inputs, targets)
Generated output from vLLM:
@dataclass
class Completion:
text: str # Generated text
token_ids: list[int] # Token IDs
logprobs: list[float] # Log probabilities
metadata: dict # Custom metadata
Loss functions are in the forge.losses module:
from forge.losses import SimpleGRPOLoss, ReinforceLoss
# SimpleGRPOLoss for GRPO training
loss_fn = SimpleGRPOLoss(beta=0.1)
# Forward pass
loss = loss_fn(
logprobs=logprobs,
ref_logprobs=ref_logprobs,
advantages=advantages,
padding_mask=padding_mask
)
from forge.losses.reinforce_loss import ReinforceLoss
# With optional importance ratio clipping
loss_fn = ReinforceLoss(clip_ratio=0.2)
Symptoms : "Insufficient GPU resources" error
Solutions :
# Reduce service requirements
services:
generator:
procs: 1
with_gpus: true
trainer:
procs: 1
with_gpus: true
# Remove ref_model (uses generator weights)
Or use CPU for reference model:
ref_model:
with_gpus: false
Symptoms : CUDA OOM in vLLM
Solutions :
# Reduce batch size
grpo:
n_samples: 4 # Reduce from 8
# Or reduce sequence length
training:
seq_len: 2048
Symptoms : Long pauses between training and generation
Solutions :
# Enable RDMA (if available)
export TORCHSTORE_USE_RDMA=1
# Or reduce sync frequency
training:
sync_interval: 10 # Sync every 10 steps
Symptoms : Entropy drops to zero, reward stops improving
Solutions :
# Increase KL penalty
grpo:
beta: 0.2 # Increase from 0.1
# Or add entropy bonus
training:
entropy_coef: 0.01
Weekly Installs
64
Repository
GitHub Stars
5.5K
First Seen
Feb 7, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
codex55
opencode55
cursor55
gemini-cli54
github-copilot53
claude-code52
超能力技能使用指南:AI助手技能调用优先级与工作流程详解
56,600 周安装