⚠️

重要前提

安装AI Skills的关键前提是：必须科学上网，且开启TUN模式，这一点至关重要，直接决定安装能否顺利完成，在此郑重提醒三遍：科学上网，科学上网，科学上网。查看完整安装教程 →

miles-rl-training：企业级强化学习框架，支持大规模MoE模型FP8/INT4训练

miles-rl-training by orchestra-research/ai-research-skills

65 周安装量

5,500 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/orchestra-research/ai-research-skills --skill miles-rl-training

AI/机器学习强化学习高性能计算

🇨🇳中文介绍

miles：面向大规模模型训练的企业级强化学习框架

miles 是一个高性能、企业就绪的强化学习框架，专为大规模模型后训练优化而构建。作为 slime 的生产分支版本，它解决了混合专家模型训练稳定性、低精度训练以及训练-推理对齐等关键挑战。

何时使用 miles

在以下场景中选择 miles：

训练 1TB 以上的混合专家模型（如 DeepSeek V3, Qwen3-MoE）
进行 FP8 或 INT4 量化感知训练
实现比特级完全一致的训练-推理对齐
使用推测式强化学习以获得最大吞吐量
需要企业级支持的生产稳定性

在以下场景中考虑替代方案：

您需要研究级的原始版本 → 使用 slime
您需要灵活的后端交换 → 使用 verl
您需要 PyTorch 原生的抽象层 → 使用 torchforge

核心特性

低精度训练

统一的 FP8：为推理和训练提供端到端的 FP8 支持
INT4 QAT：在单机 VRAM（如 H200）上运行 1TB 模型
Rollout Routing Replay：为混合专家模型提供比特级专家对齐

性能优化

推测式强化学习：使用在线 SFT 草稿模型，rollout 速度提升 25% 以上
零拷贝权重同步：基于 CUDA IPC 的零拷贝映射
部分 Rollout：回收未完成的轨迹

训练-推理对齐

TIS/MIS：用于离策略校正的截断/掩码重要性采样

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

工作流 1：大规模混合专家模型训练

此工作流适用于训练大型混合专家模型，如 DeepSeek V3 或 Qwen3-MoE。

支持 FP8 的 H100/H200 GPU
混合专家模型（DeepSeek V3, Qwen3-MoE）
包含 miles 的 Docker 环境

步骤 1：环境设置

# FP8 块缩放（推荐用于稳定性）
export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1
export CUDA_DEVICE_MAX_CONNECTIONS=1

步骤 2：配置训练

python train.py \
    --actor-num-gpus-per-node 8 \
    --rollout-num-gpus 8 \
    --hf-checkpoint /path/to/deepseek-v3 \
    --advantage-estimator grpo \
    --tensor-model-parallel-size 8 \
    --expert-model-parallel-size 4 \
    --prompt-data /path/to/data.jsonl \
    --num-rollout 3000

模型加载无错误
路由决策一致
损失值中没有 NaN/Inf

工作流 2：推测式强化学习训练

此工作流适用于通过 EAGLE 推测式解码实现最大 rollout 吞吐量。

推测式强化学习工作原理

小型草稿模型生成候选令牌
目标模型并行验证
草稿模型通过在线 SFT 更新以跟踪策略

步骤 1：启用推测式解码

miles 通过 SGLang 支持 EAGLE 推测式解码：

python train.py \
    --actor-num-gpus-per-node 8 \
    --hf-checkpoint /path/to/target-model \
    --sglang-speculative-algorithm EAGLE \
    --sglang-speculative-num-steps 3 \
    --sglang-speculative-eagle-topk 1 \
    --sglang-speculative-num-draft-tokens 4 \
    --sglang-speculative-draft-model-path /path/to/draft-model \
    --advantage-estimator grpo \
    --prompt-data /path/to/data.jsonl

步骤 2：启用在线 MTP 训练（可选）

用于在训练期间对草稿模型进行在线 SFT：

--mtp-num-layers 1 \
--enable-mtp-training \
--mtp-loss-scaling-factor 0.2

注意：在线 MTP 训练需要一个包含 MTP 权重的 torch dist 检查点。在从 HuggingFace 转换检查点时添加 --mtp-num-layers 1。

标准 rollout：基线
推测式强化学习：rollout 速度提升 25-40%
结合部分 rollout：额外提升 10-15% 的吞吐量

miles 继承了 slime 的所有参数。完整列表请参见 slime API 参考文档。

集群资源（继承自 slime）

--actor-num-nodes 1
--actor-num-gpus-per-node 8
--rollout-num-gpus 8
--rollout-num-gpus-per-engine 2
--colocate

Megatron 并行策略（继承自 slime）

--tensor-model-parallel-size 8
--pipeline-model-parallel-size 2
--expert-model-parallel-size 4    # MoE 专家并行

推测式解码（miles 特有）

--sglang-speculative-algorithm EAGLE
--sglang-speculative-num-steps 3
--sglang-speculative-eagle-topk 1
--sglang-speculative-num-draft-tokens 4
--sglang-enable-draft-weights-cpu-backup
--sglang-speculative-draft-model-path /your/draft/model/path

在线 MTP 训练（miles 特有）

--mtp-num-layers 1
--enable-mtp-training
--mtp-loss-scaling-factor 0.2

核心特性（概念性）

以下特性已在 miles 中记录，但具体的 CLI 标志可能有所不同。请查阅 miles 仓库获取最新配置。

统一的 FP8 流水线

端到端的 FP8 采样和训练，消除了导致混合专家模型中强化学习崩溃的量化差异。

Rollout Routing Replay

在 SGLang 推理期间记录专家路由决策，并在 Megatron 训练期间重放，以实现比特级专家对齐。

R3 工作原理：

在 SGLang 推理期间，记录专家路由决策
路由决策存储在 sample.rollout_routed_experts 中
在 Megatron 训练期间，重放路由决策而非重新计算
确保训练和推理之间的专家选择完全相同

INT4 量化感知训练

支持在单机上部署 1TB 以上的模型（例如，在 H200 上）。

使用 INT4 的内存节省：

模型大小	BF16 VRAM	INT4 VRAM	减少倍数
70B	140GB	45GB	3.1x
235B	470GB	150GB	3.1x
671B	1.3TB	420GB	3.1x

miles 通过以下方式实现训练和推理之间“完全为零的 KL 散度”：

Flash Attention 3
DeepGEMM
来自 Thinking Machines Lab 的批处理不变内核
torch.compile 集成

miles 使用与 slime 相同的 Sample 数据类，并包含用于混合专家路由重放的 rollout_routed_experts 字段：

@dataclass
class Sample:
    prompt: str | list[dict]
    tokens: list[int]
    response: str
    reward: float | dict
    loss_mask: list[int]
    status: Status
    metadata: dict
    rollout_log_probs: list[float]
    rollout_routed_experts: list[list[int]]  # 用于 R3 的 MoE 路由

完整的 Sample 定义请参见 slime API 参考文档。

常见问题与解决方案

问题：FP8 训练崩溃

症状：损失值爆炸，出现 NaN 值

使用块缩放：export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1
降低学习率：--lr 5e-7
确保训练/推理之间的混合专家模型路由一致

问题：推测式草稿模型漂移

症状：随时间推移，接受率降低

启用在线 MTP 训练以保持草稿模型对齐
减少推测步数：--sglang-speculative-num-steps 2
使用 CPU 备份：--sglang-enable-draft-weights-cpu-backup

问题：训练-推理不匹配

症状：策略发散，奖励崩溃

使用 TIS 进行离策略校正：--use-tis --tis-threshold 0.9
验证 SGLang 和 Megatron 之间的对数概率是否匹配
为混合专家模型启用 R3

系列	模型	混合专家支持
DeepSeek	R1, V3, V3.2	完全支持
Qwen	2, 2.5, 3 (包括 MoE)	完全支持
Llama	3, 3.1, 3.3, 4	仅密集模型
Gemma	2, 3, 3N	仅密集模型
GLM	4.5, 4.6, 4.7	仅密集模型
MiniMax	M2, M2.1	完全支持

🇺🇸English

miles: Enterprise-Grade RL for Large-Scale Model Training

miles is a high-performance, enterprise-ready RL framework optimized for large-scale model post-training. Built as a production fork of slime, it addresses critical challenges in MoE training stability, low-precision training, and train-inference alignment.

When to Use miles

Choose miles when you need:

Training 1TB+ MoE models (DeepSeek V3, Qwen3-MoE)
FP8 or INT4 quantization-aware training
Bit-wise identical train-inference alignment
Speculative RL for maximum throughput
Production stability with enterprise support

Consider alternatives when:

You want the research-grade original → use slime
You need flexible backend swapping → use verl
You want PyTorch-native abstractions → use torchforge

Key Features

Low-Precision Training

Unified FP8 : End-to-end FP8 for both inference and training
INT4 QAT : 1TB models on single-machine VRAM (H200)
Rollout Routing Replay (R3) : Bit-wise expert alignment for MoE

Performance Optimizations

Speculative RL : 25%+ rollout speedup with online SFT draft models
Zero-Copy Weight Sync : CUDA IPC zero-copy mapping
Partial Rollout : Recycle half-finished trajectories

Train-Inference Alignment

TIS/MIS : Truncated/Masked Importance Sampling for off-policy correction
Kernel-level optimization : FlashAttention-3, DeepGEMM integration

Installation

# Recommended: Docker
docker pull radixark/miles:latest
docker run --rm --gpus all --ipc=host --shm-size=16g \
  -it radixark/miles:latest /bin/bash

# From source
git clone https://github.com/radixark/miles.git
cd miles
pip install -r requirements.txt
pip install -e .

Quick Start

miles inherits slime's configuration system. Basic training:

python train.py \
    --advantage-estimator grpo \
    --model-name qwen3-30b-a3b \
    --hf-checkpoint /path/to/qwen3-30b-a3b-hf \
    --rollout-batch-size 512 \
    --n-samples-per-prompt 8

Workflow 1: Large MoE Training

Use this workflow for training large MoE models like DeepSeek V3 or Qwen3-MoE.

Prerequisites Checklist

H100/H200 GPUs with FP8 support
MoE model (DeepSeek V3, Qwen3-MoE)
Docker environment with miles

Step 1: Environment Setup

# FP8 block scaling (recommended for stability)
export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1
export CUDA_DEVICE_MAX_CONNECTIONS=1

Step 2: Configure Training

python train.py \
    --actor-num-gpus-per-node 8 \
    --rollout-num-gpus 8 \
    --hf-checkpoint /path/to/deepseek-v3 \
    --advantage-estimator grpo \
    --tensor-model-parallel-size 8 \
    --expert-model-parallel-size 4 \
    --prompt-data /path/to/data.jsonl \
    --num-rollout 3000

Verification Checklist

Model loads without errors
Routing decisions are consistent
No NaN/Inf in loss values

Workflow 2: Speculative RL Training

Use this workflow for maximum rollout throughput with EAGLE speculative decoding.

How Speculative RL Works

Small draft model generates candidate tokens
Target model verifies in parallel
Draft model updated via online SFT to track policy

Step 1: Enable Speculative Decoding

miles supports EAGLE speculative decoding via SGLang:

python train.py \
    --actor-num-gpus-per-node 8 \
    --hf-checkpoint /path/to/target-model \
    --sglang-speculative-algorithm EAGLE \
    --sglang-speculative-num-steps 3 \
    --sglang-speculative-eagle-topk 1 \
    --sglang-speculative-num-draft-tokens 4 \
    --sglang-speculative-draft-model-path /path/to/draft-model \
    --advantage-estimator grpo \
    --prompt-data /path/to/data.jsonl

Step 2: Enable Online MTP Training (Optional)

For online SFT of draft model during training:

--mtp-num-layers 1 \
--enable-mtp-training \
--mtp-loss-scaling-factor 0.2

Note : Online MTP training requires a torch dist checkpoint with MTP weights. Add --mtp-num-layers 1 during checkpoint conversion from HuggingFace.

Expected Speedup

Standard rollout : Baseline
Speculative RL : 25-40% faster rollout
With partial rollout : Additional 10-15% throughput

Configuration Reference

miles inherits all slime arguments. See slime API Reference for the complete list.

Cluster Resources (from slime)

--actor-num-nodes 1
--actor-num-gpus-per-node 8
--rollout-num-gpus 8
--rollout-num-gpus-per-engine 2
--colocate

Megatron Parallelism (from slime)

--tensor-model-parallel-size 8
--pipeline-model-parallel-size 2
--expert-model-parallel-size 4    # MoE expert parallelism

Speculative Decoding (miles-specific)

--sglang-speculative-algorithm EAGLE
--sglang-speculative-num-steps 3
--sglang-speculative-eagle-topk 1
--sglang-speculative-num-draft-tokens 4
--sglang-enable-draft-weights-cpu-backup
--sglang-speculative-draft-model-path /your/draft/model/path

Online MTP Training (miles-specific)

--mtp-num-layers 1
--enable-mtp-training
--mtp-loss-scaling-factor 0.2

Key Features (Conceptual)

The following features are documented in miles but specific CLI flags may vary. Consult the miles repository for latest configuration.

Unified FP8 Pipeline

End-to-end FP8 sampling and training that eliminates quantization-induced discrepancy causing RL collapse in MoE models.

Rollout Routing Replay (R3)

Records expert routing decisions during SGLang inference and replays them during Megatron training for bit-wise expert alignment.

How R3 Works :

During SGLang inference, expert routing decisions are recorded
Routing decisions stored in sample.rollout_routed_experts
During Megatron training, routing is replayed instead of recomputed
Ensures identical expert selection between train and inference

INT4 Quantization-Aware Training

Enables single-machine deployment of 1TB+ models (e.g., on H200).

Memory Savings with INT4 :

Model Size	BF16 VRAM	INT4 VRAM	Reduction
70B	140GB	45GB	3.1x
235B	470GB	150GB	3.1x
671B	1.3TB	420GB	3.1x

Train-Inference Alignment

miles achieves "exactly 0 KL divergence" between training and inference through:

Flash Attention 3
DeepGEMM
Batch-invariant kernels from Thinking Machines Lab
torch.compile integration

Sample Data Structure

miles uses the same Sample dataclass as slime with the rollout_routed_experts field for MoE routing replay:

@dataclass
class Sample:
    prompt: str | list[dict]
    tokens: list[int]
    response: str
    reward: float | dict
    loss_mask: list[int]
    status: Status
    metadata: dict
    rollout_log_probs: list[float]
    rollout_routed_experts: list[list[int]]  # MoE routing for R3

See slime API Reference for the complete Sample definition.

Common Issues and Solutions

Issue: FP8 Training Collapse

Symptoms : Loss explodes, NaN values

Solutions :

Use block scaling: export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1
Reduce learning rate: --lr 5e-7
Ensure MoE routing is consistent between train/inference

Issue: Speculative Draft Drift

Symptoms : Low acceptance rate over time

Solutions :

Enable online MTP training to keep draft model aligned
Reduce speculative steps: --sglang-speculative-num-steps 2
Use CPU backup: --sglang-enable-draft-weights-cpu-backup

Issue: Train-Inference Mismatch

Symptoms : Policy divergence, reward collapse

Solutions :

Use TIS for off-policy correction: --use-tis --tis-threshold 0.9
Verify log probs match between SGLang and Megatron
Enable R3 for MoE models

Supported Models

Family	Models	MoE Support
DeepSeek	R1, V3, V3.2	Full
Qwen	2, 2.5, 3 (including MoE)	Full
Llama	3, 3.1, 3.3, 4	Dense only
Gemma	2, 3, 3N	Dense only
GLM	4.5, 4.6, 4.7	Dense only
MiniMax	M2, M2.1	Full

Resources

GitHub : https://github.com/radixark/miles
Introduction Blog : https://lmsys.org/blog/2025-11-19-miles/
Slime (upstream) : https://github.com/THUDM/slime
SGLang : https://github.com/sgl-project/sglang

Weekly Installs

Repository

orchestra-resea…h-skills

GitHub Stars

5.5K

First Seen

Feb 7, 2026

Security Audits

Gen Agent Trust HubFail SocketPass SnykPass

Installed on

opencode55

codex54

cursor54

gemini-cli53

claude-code53

github-copilot52

超能力技能使用指南：AI助手技能调用优先级与工作流程详解

53,700 周安装

miles-rl-training：企业级强化学习框架，支持大规模MoE模型FP8/INT4训练

🇨🇳中文介绍

miles：面向大规模模型训练的企业级强化学习框架

何时使用 miles

核心特性

低精度训练

性能优化

训练-推理对齐

相关 Skills

安装

快速开始

工作流 1：大规模混合专家模型训练

先决条件清单

步骤 1：环境设置

步骤 2：配置训练

验证清单

工作流 2：推测式强化学习训练

推测式强化学习工作原理

步骤 1：启用推测式解码

步骤 2：启用在线 MTP 训练（可选）

预期加速效果

配置参考

集群资源（继承自 slime）

Megatron 并行策略（继承自 slime）

推测式解码（miles 特有）

在线 MTP 训练（miles 特有）

核心特性（概念性）

统一的 FP8 流水线

Rollout Routing Replay

INT4 量化感知训练

训练-推理对齐

样本数据结构

常见问题与解决方案

问题：FP8 训练崩溃

问题：推测式草稿模型漂移

问题：训练-推理不匹配

支持的模型

资源

🇺🇸English

miles: Enterprise-Grade RL for Large-Scale Model Training

When to Use miles

Key Features

Low-Precision Training

Performance Optimizations

Train-Inference Alignment

Installation

Quick Start

Workflow 1: Large MoE Training

Prerequisites Checklist

Step 1: Environment Setup

Step 2: Configure Training

Verification Checklist

Workflow 2: Speculative RL Training

How Speculative RL Works

Step 1: Enable Speculative Decoding

Step 2: Enable Online MTP Training (Optional)

Expected Speedup

Configuration Reference

Cluster Resources (from slime)

Megatron Parallelism (from slime)

Speculative Decoding (miles-specific)

Online MTP Training (miles-specific)

Key Features (Conceptual)

Unified FP8 Pipeline

Rollout Routing Replay (R3)

INT4 Quantization-Aware Training

Train-Inference Alignment

Sample Data Structure

Common Issues and Solutions

Issue: FP8 Training Collapse

Issue: Speculative Draft Drift

Issue: Train-Inference Mismatch

Supported Models

Resources

最新 Skills