⚠️

重要前提

安装AI Skills的关键前提是：必须科学上网，且开启TUN模式，这一点至关重要，直接决定安装能否顺利完成，在此郑重提醒三遍：科学上网，科学上网，科学上网。查看完整安装教程 →

Slime-RL-Training：清华大学THUDM开发的强化学习LLM后训练框架，支持GLM/Qwen3/DeepSeek/Llama

slime-rl-training by orchestra-research/ai-research-skills

107 周安装量

6,500 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/orchestra-research/ai-research-skills --skill slime-rl-training

AI/机器学习强化学习 PyTorch

🇨🇳中文介绍

slime：用于强化学习扩展的 LLM 后训练框架

slime 是清华大学 THUDM 团队开发的 LLM 后训练框架，为 GLM-4.5、GLM-4.6 和 GLM-4.7 提供支持。它连接 Megatron-LM 进行训练，并使用 SGLang 进行高吞吐量的 rollout 生成。

何时使用 slime

在以下情况下选择 slime：

需要结合 Megatron-LM 原生训练与 SGLang 推理
需要具有灵活数据缓冲区的自定义数据生成工作流
训练 GLM、Qwen3、DeepSeek V3 或 Llama 3 模型
需要具有生产环境支持（Z.ai）的研究级框架

在以下情况下考虑替代方案：

需要企业级稳定性功能 → 使用 miles
希望灵活切换后端 → 使用 verl
需要 PyTorch 原生抽象 → 使用 torchforge

主要特性

训练：支持全并行（TP、PP、DP、SP）的 Megatron-LM
Rollout：基于 SGLang 的高吞吐量生成，带路由器
数据缓冲区：灵活的提示词管理和样本存储
模型：GLM-4.x、Qwen3、DeepSeek V3/R1、Llama 3

架构概述

┌─────────────────────────────────────────────────────────┐
│                    数据缓冲区                           │
│ - 提示词初始化和管理                                   │
│ - 自定义数据生成和过滤                                 │
│ - Rollout 样本存储                                     │
└─────────────┬───────────────────────────┬───────────────┘
              │                           │
┌─────────────▼───────────┐ ┌─────────────▼───────────────┐
│ 训练 (Megatron-LM)      │ │ Rollout (SGLang + 路由器)   │
│ - Actor 模型训练        │ │ - 响应生成                  │
│ - Critic (可选)         │ │ - 奖励/验证器输出           │
│ - 权重同步到 rollout    │ │ - 多轮对话支持              │
└─────────────────────────┘ └─────────────────────────────┘

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

快速开始：GRPO 训练

# 源模型配置
source scripts/models/qwen3-4B.sh

# 启动训练
python train.py \
    --actor-num-nodes 1 \
    --actor-num-gpus-per-node 4 \
    --rollout-num-gpus 4 \
    --advantage-estimator grpo \
    --use-kl-loss --kl-loss-coef 0.001 \
    --rollout-batch-size 32 \
    --n-samples-per-prompt 8 \
    --global-batch-size 256 \
    --num-rollout 3000 \
    --prompt-data /path/to/data.jsonl \
    ${MODEL_ARGS[@]} ${CKPT_ARGS[@]}

工作流 1：标准 GRPO 训练

使用此工作流训练具有组相对优势的推理模型。

Docker 环境或已安装 Megatron-LM + SGLang
模型检查点（HuggingFace 或 Megatron 格式）
JSONL 格式的训练数据

步骤 1：准备数据

# data.jsonl 格式
{"prompt": "What is 2 + 2?", "label": "4"}
{"prompt": "Solve: 3x = 12", "label": "x = 4"}

或使用聊天格式：

{
    "prompt": [
        {"role": "system", "content": "You are a math tutor."},
        {"role": "user", "content": "What is 15 + 27?"}
    ],
    "label": "42"
}

步骤 2：配置模型

选择一个预配置的模型脚本：

# 列出可用模型
ls scripts/models/
# glm4-9B.sh, qwen3-4B.sh, qwen3-30B-A3B.sh, deepseek-v3.sh, llama3-8B.sh, ...

# 源化你的模型
source scripts/models/qwen3-4B.sh

步骤 3：启动训练

python train.py \
    --actor-num-nodes 1 \
    --actor-num-gpus-per-node 8 \
    --rollout-num-gpus 8 \
    --advantage-estimator grpo \
    --use-kl-loss \
    --kl-loss-coef 0.001 \
    --prompt-data /path/to/train.jsonl \
    --input-key prompt \
    --label-key label \
    --apply-chat-template \
    --rollout-batch-size 32 \
    --n-samples-per-prompt 8 \
    --global-batch-size 256 \
    --num-rollout 3000 \
    --save-interval 100 \
    --eval-interval 50 \
    ${MODEL_ARGS[@]}

步骤 4：监控训练

检查 TensorBoard：tensorboard --logdir outputs/
验证奖励曲线是否上升
监控各节点的 GPU 利用率

工作流 2：异步训练

使用异步模式，通过重叠 rollout 和训练来提高吞吐量。

模型较大且生成时间较长
同步模式下 GPU 空闲时间高
有足够的内存用于缓冲

python train_async.py \
    --actor-num-nodes 1 \
    --actor-num-gpus-per-node 8 \
    --rollout-num-gpus 8 \
    --advantage-estimator grpo \
    --async-buffer-size 4 \
    --prompt-data /path/to/train.jsonl \
    ${MODEL_ARGS[@]}

--async-buffer-size 4        # 要缓冲的 rollout 数量
--update-weights-interval 2  # 每 N 个 rollout 同步一次权重

工作流 3：多轮智能体训练

使用此工作流训练具有工具使用或多步推理能力的智能体。

用于多轮逻辑的自定义生成函数
工具/环境接口

步骤 1：定义自定义生成函数

# custom_generate.py
async def custom_generate(args, samples, evaluation=False):
    """带工具调用的多轮生成。"""
    for sample in samples:
        conversation = sample.prompt

        for turn in range(args.max_turns):
            # 生成响应
            response = await generate_single(conversation)

            # 检查工具调用
            tool_call = extract_tool_call(response)
            if tool_call:
                tool_result = execute_tool(tool_call)
                conversation.append({"role": "assistant", "content": response})
                conversation.append({"role": "tool", "content": tool_result})
            else:
                break

        sample.response = response
        sample.reward = compute_reward(sample)

    return samples

步骤 2：使用自定义函数启动

python train.py \
    --custom-generate-function-path custom_generate.py \
    --max-turns 5 \
    --prompt-data /path/to/agent_data.jsonl \
    ${MODEL_ARGS[@]}

完整的多轮搜索示例请参见 examples/search-r1/。

slime 使用三种类型的参数：

1. Megatron 参数（直接传递）：

--tensor-model-parallel-size 2
--pipeline-model-parallel-size 1
--num-layers 32
--hidden-size 4096

2. SGLang 参数（以 --sglang- 为前缀）：

--sglang-mem-fraction-static 0.8
--sglang-context-length 8192
--sglang-log-level INFO

3. slime 参数：

# 资源分配
--actor-num-nodes 1
--actor-num-gpus-per-node 8
--rollout-num-gpus 8
--colocate  # 在训练/推理之间共享 GPU

# 数据
--prompt-data /path/to/data.jsonl
--input-key prompt
--label-key label

# 训练循环
--num-rollout 3000
--rollout-batch-size 32
--n-samples-per-prompt 8
--global-batch-size 256

# 算法
--advantage-estimator grpo  # 或：gspo, ppo, reinforce_plus_plus
--use-kl-loss
--kl-loss-coef 0.001

rollout_batch_size × n_samples_per_prompt = global_batch_size × num_steps_per_rollout

示例：32 × 8 = 256 × 1

数据缓冲区系统

slime 的数据缓冲区支持灵活的数据管理：

class RolloutDataSource:
    def get_samples(self, num_samples):
        """从数据集中获取提示词。"""
        return self.dataset.sample(num_samples)

    def add_samples(self, samples):
        """生成后调用（默认无操作）。"""
        pass

带缓冲的数据源（离策略）

class RolloutDataSourceWithBuffer(RolloutDataSource):
    def __init__(self):
        self.buffer = []

    def add_samples(self, samples):
        """存储生成的样本以供重用。"""
        self.buffer.extend(samples)

    def buffer_filter(self, args, buffer, num_samples):
        """自定义选择逻辑（优先级、分层等）。"""
        return select_best(buffer, num_samples)

常见问题及解决方案

问题：SGLang 引擎崩溃

症状：推理引擎在训练中途死亡

# 启用容错
--use-fault-tolerance

# 增加内存分配
--sglang-mem-fraction-static 0.85

# 减小批次大小
--rollout-batch-size 16

问题：权重同步超时

症状：训练在 rollout 后挂起

# 增加同步间隔
--update-weights-interval 5

# 使用共置模式（无网络传输）
--colocate

问题：训练期间 OOM

症状：反向传播中出现 CUDA OOM

# 启用梯度检查点
--recompute-activations

# 减小微批次大小
--micro-batch-size 1

# 启用序列并行
--sequence-parallel

问题：数据加载缓慢

症状：GPU 在获取数据时空闲

# 增加数据工作线程数
--num-data-workers 4

# 使用流式数据集
--streaming-data

模型系列	配置
GLM	GLM-4.5, GLM-4.6, GLM-4.7, GLM-Z1-9B
Qwen	Qwen3 (4B, 8B, 30B-A3B), Qwen3-MoE, Qwen2.5
DeepSeek	V3, V3.1, R1
Llama	Llama 3 (8B, 70B)
其他	Kimi K2, Moonlight-16B

每个模型在 scripts/models/ 目录下都有预配置的脚本。

在训练和推理之间共享 GPU 以减少内存占用：

python train.py \
    --colocate \
    --actor-num-gpus-per-node 8 \
    --sglang-mem-fraction-static 0.4 \
    ${MODEL_ARGS[@]}

自定义奖励模型

# custom_rm.py
class CustomRewardModel:
    def __init__(self, model_path):
        self.model = load_model(model_path)

    def compute_reward(self, prompts, responses):
        inputs = self.tokenize(prompts, responses)
        scores = self.model(inputs)
        return scores.tolist()


--custom-rm-path custom_rm.py

--eval-prompt-data aime /path/to/aime.jsonl \
--eval-prompt-data gsm8k /path/to/gsm8k.jsonl \
--n-samples-per-eval-prompt 16

文档：https://thudm.github.io/slime/
GitHub：https://github.com/THUDM/slime
博客：https://lmsys.org/blog/2025-07-09-slime/
示例：参见 examples/ 目录下的 14 多个工作示例

🇺🇸English

slime: LLM Post-Training Framework for RL Scaling

slime is an LLM post-training framework from Tsinghua's THUDM team, powering GLM-4.5, GLM-4.6, and GLM-4.7. It connects Megatron-LM for training with SGLang for high-throughput rollout generation.

When to Use slime

Choose slime when you need:

Megatron-LM native training with SGLang inference
Custom data generation workflows with flexible data buffers
Training GLM, Qwen3, DeepSeek V3, or Llama 3 models
Research-grade framework with production backing (Z.ai)

Consider alternatives when:

You need enterprise-grade stability features → use miles
You want flexible backend swapping → use verl
You need PyTorch-native abstractions → use torchforge

Key Features

Training : Megatron-LM with full parallelism support (TP, PP, DP, SP)
Rollout : SGLang-based high-throughput generation with router
Data Buffer : Flexible prompt management and sample storage
Models : GLM-4.x, Qwen3, DeepSeek V3/R1, Llama 3

Architecture Overview

┌─────────────────────────────────────────────────────────┐
│                    Data Buffer                          │
│ - Prompt initialization and management                  │
│ - Custom data generation and filtering                  │
│ - Rollout sample storage                                │
└─────────────┬───────────────────────────┬───────────────┘
              │                           │
┌─────────────▼───────────┐ ┌─────────────▼───────────────┐
│ Training (Megatron-LM)  │ │ Rollout (SGLang + Router)   │
│ - Actor model training  │ │ - Response generation       │
│ - Critic (optional)     │ │ - Reward/verifier output    │
│ - Weight sync to rollout│ │ - Multi-turn support        │
└─────────────────────────┘ └─────────────────────────────┘

Installation

# Recommended: Docker
docker pull slimerl/slime:latest
docker run --rm --gpus all --ipc=host --shm-size=16g \
  -it slimerl/slime:latest /bin/bash

# Inside container
cd /root/slime && pip install -e . --no-deps

From Source

git clone https://github.com/THUDM/slime.git
cd slime
pip install -r requirements.txt
pip install -e .

Quick Start: GRPO Training

# Source model configuration
source scripts/models/qwen3-4B.sh

# Launch training
python train.py \
    --actor-num-nodes 1 \
    --actor-num-gpus-per-node 4 \
    --rollout-num-gpus 4 \
    --advantage-estimator grpo \
    --use-kl-loss --kl-loss-coef 0.001 \
    --rollout-batch-size 32 \
    --n-samples-per-prompt 8 \
    --global-batch-size 256 \
    --num-rollout 3000 \
    --prompt-data /path/to/data.jsonl \
    ${MODEL_ARGS[@]} ${CKPT_ARGS[@]}

Workflow 1: Standard GRPO Training

Use this workflow for training reasoning models with group-relative advantages.

Prerequisites Checklist

Docker environment or Megatron-LM + SGLang installed
Model checkpoint (HuggingFace or Megatron format)
Training data in JSONL format

Step 1: Prepare Data

# data.jsonl format
{"prompt": "What is 2 + 2?", "label": "4"}
{"prompt": "Solve: 3x = 12", "label": "x = 4"}

Or with chat format:

{
    "prompt": [
        {"role": "system", "content": "You are a math tutor."},
        {"role": "user", "content": "What is 15 + 27?"}
    ],
    "label": "42"
}

Step 2: Configure Model

Choose a pre-configured model script:

# List available models
ls scripts/models/
# glm4-9B.sh, qwen3-4B.sh, qwen3-30B-A3B.sh, deepseek-v3.sh, llama3-8B.sh, ...

# Source your model
source scripts/models/qwen3-4B.sh

Step 3: Launch Training

python train.py \
    --actor-num-nodes 1 \
    --actor-num-gpus-per-node 8 \
    --rollout-num-gpus 8 \
    --advantage-estimator grpo \
    --use-kl-loss \
    --kl-loss-coef 0.001 \
    --prompt-data /path/to/train.jsonl \
    --input-key prompt \
    --label-key label \
    --apply-chat-template \
    --rollout-batch-size 32 \
    --n-samples-per-prompt 8 \
    --global-batch-size 256 \
    --num-rollout 3000 \
    --save-interval 100 \
    --eval-interval 50 \
    ${MODEL_ARGS[@]}

Step 4: Monitor Training

Check TensorBoard: tensorboard --logdir outputs/
Verify reward curves are increasing
Monitor GPU utilization across nodes

Workflow 2: Asynchronous Training

Use async mode for higher throughput by overlapping rollout and training.

When to Use Async

Large models with long generation times
High GPU idle time in synchronous mode
Sufficient memory for buffering

Launch Async Training

python train_async.py \
    --actor-num-nodes 1 \
    --actor-num-gpus-per-node 8 \
    --rollout-num-gpus 8 \
    --advantage-estimator grpo \
    --async-buffer-size 4 \
    --prompt-data /path/to/train.jsonl \
    ${MODEL_ARGS[@]}

Async-Specific Parameters

--async-buffer-size 4        # Number of rollouts to buffer
--update-weights-interval 2  # Sync weights every N rollouts

Workflow 3: Multi-Turn Agentic Training

Use this workflow for training agents with tool use or multi-step reasoning.

Prerequisites

Custom generate function for multi-turn logic
Tool/environment interface

Step 1: Define Custom Generate Function

# custom_generate.py
async def custom_generate(args, samples, evaluation=False):
    """Multi-turn generation with tool calling."""
    for sample in samples:
        conversation = sample.prompt

        for turn in range(args.max_turns):
            # Generate response
            response = await generate_single(conversation)

            # Check for tool call
            tool_call = extract_tool_call(response)
            if tool_call:
                tool_result = execute_tool(tool_call)
                conversation.append({"role": "assistant", "content": response})
                conversation.append({"role": "tool", "content": tool_result})
            else:
                break

        sample.response = response
        sample.reward = compute_reward(sample)

    return samples

Step 2: Launch with Custom Function

python train.py \
    --custom-generate-function-path custom_generate.py \
    --max-turns 5 \
    --prompt-data /path/to/agent_data.jsonl \
    ${MODEL_ARGS[@]}

See examples/search-r1/ for a complete multi-turn search example.

Configuration Reference

Three Argument Categories

slime uses three types of arguments:

1. Megatron Arguments (passed directly):

--tensor-model-parallel-size 2
--pipeline-model-parallel-size 1
--num-layers 32
--hidden-size 4096

2. SGLang Arguments (prefixed with --sglang-):

--sglang-mem-fraction-static 0.8
--sglang-context-length 8192
--sglang-log-level INFO

3. slime Arguments :

# Resource allocation
--actor-num-nodes 1
--actor-num-gpus-per-node 8
--rollout-num-gpus 8
--colocate  # Share GPUs between training/inference

# Data
--prompt-data /path/to/data.jsonl
--input-key prompt
--label-key label

# Training loop
--num-rollout 3000
--rollout-batch-size 32
--n-samples-per-prompt 8
--global-batch-size 256

# Algorithm
--advantage-estimator grpo  # or: gspo, ppo, reinforce_plus_plus
--use-kl-loss
--kl-loss-coef 0.001

Key Constraints

rollout_batch_size × n_samples_per_prompt = global_batch_size × num_steps_per_rollout

Example: 32 × 8 = 256 × 1

Data Buffer System

slime's data buffer enables flexible data management:

Basic Data Source

class RolloutDataSource:
    def get_samples(self, num_samples):
        """Fetch prompts from dataset."""
        return self.dataset.sample(num_samples)

    def add_samples(self, samples):
        """Called after generation (no-op by default)."""
        pass

Buffered Data Source (Off-Policy)

class RolloutDataSourceWithBuffer(RolloutDataSource):
    def __init__(self):
        self.buffer = []

    def add_samples(self, samples):
        """Store generated samples for reuse."""
        self.buffer.extend(samples)

    def buffer_filter(self, args, buffer, num_samples):
        """Custom selection logic (prioritized, stratified, etc.)."""
        return select_best(buffer, num_samples)

Common Issues and Solutions

Issue: SGLang Engine Crash

Symptoms : Inference engine dies mid-training

Solutions :

# Enable fault tolerance
--use-fault-tolerance

# Increase memory allocation
--sglang-mem-fraction-static 0.85

# Reduce batch size
--rollout-batch-size 16

Issue: Weight Sync Timeout

Symptoms : Training hangs after rollout

Solutions :

# Increase sync interval
--update-weights-interval 5

# Use colocated mode (no network transfer)
--colocate

Issue: OOM During Training

Symptoms : CUDA OOM in backward pass

Solutions :

# Enable gradient checkpointing
--recompute-activations

# Reduce micro-batch size
--micro-batch-size 1

# Enable sequence parallelism
--sequence-parallel

Issue: Slow Data Loading

Symptoms : GPU idle during data fetch

Solutions :

# Increase data workers
--num-data-workers 4

# Use streaming dataset
--streaming-data

Supported Models

Model Family	Configurations
GLM	GLM-4.5, GLM-4.6, GLM-4.7, GLM-Z1-9B
Qwen	Qwen3 (4B, 8B, 30B-A3B), Qwen3-MoE, Qwen2.5
DeepSeek	V3, V3.1, R1
Llama	Llama 3 (8B, 70B)
Others	Kimi K2, Moonlight-16B

Each model has pre-configured scripts in scripts/models/.

Advanced Topics

Co-location Mode

Share GPUs between training and inference to reduce memory:

python train.py \
    --colocate \
    --actor-num-gpus-per-node 8 \
    --sglang-mem-fraction-static 0.4 \
    ${MODEL_ARGS[@]}

Custom Reward Model

# custom_rm.py
class CustomRewardModel:
    def __init__(self, model_path):
        self.model = load_model(model_path)

    def compute_reward(self, prompts, responses):
        inputs = self.tokenize(prompts, responses)
        scores = self.model(inputs)
        return scores.tolist()



--custom-rm-path custom_rm.py

Evaluation Multi-Task

--eval-prompt-data aime /path/to/aime.jsonl \
--eval-prompt-data gsm8k /path/to/gsm8k.jsonl \
--n-samples-per-eval-prompt 16

Resources

Documentation : https://thudm.github.io/slime/
GitHub : https://github.com/THUDM/slime
Blog : https://lmsys.org/blog/2025-07-09-slime/
Examples : See examples/ directory for 14+ worked examples

Weekly Installs

Repository

orchestra-resea…h-skills

GitHub Stars

5.7K

First Seen

Feb 7, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode60

codex59

cursor59

gemini-cli58

claude-code57

github-copilot57

超能力技能使用指南：AI助手技能调用优先级与工作流程详解

56,600 周安装

Slime-RL-Training：清华大学THUDM开发的强化学习LLM后训练框架，支持GLM/Qwen3/DeepSeek/Llama

🇨🇳中文介绍

slime：用于强化学习扩展的 LLM 后训练框架

何时使用 slime

主要特性

架构概述

相关 Skills

安装

从源码安装

快速开始：GRPO 训练

工作流 1：标准 GRPO 训练

先决条件清单

步骤 1：准备数据

步骤 2：配置模型

步骤 3：启动训练

步骤 4：监控训练

工作流 2：异步训练

何时使用异步

启动异步训练

异步特定参数

工作流 3：多轮智能体训练

先决条件

步骤 1：定义自定义生成函数

步骤 2：使用自定义函数启动

配置参考

三种参数类别

关键约束

数据缓冲区系统

基本数据源

带缓冲的数据源（离策略）

常见问题及解决方案

问题：SGLang 引擎崩溃

问题：权重同步超时

问题：训练期间 OOM

问题：数据加载缓慢

支持的模型

高级主题

共置模式

自定义奖励模型

评估多任务

资源

🇺🇸English

slime: LLM Post-Training Framework for RL Scaling

When to Use slime

Key Features

Architecture Overview

Installation

From Source

Quick Start: GRPO Training

Workflow 1: Standard GRPO Training

Prerequisites Checklist

Step 1: Prepare Data

Step 2: Configure Model

Step 3: Launch Training

Step 4: Monitor Training

Workflow 2: Asynchronous Training

When to Use Async

Launch Async Training

Async-Specific Parameters

Workflow 3: Multi-Turn Agentic Training

Prerequisites

Step 1: Define Custom Generate Function

Step 2: Launch with Custom Function

Configuration Reference

Three Argument Categories

Key Constraints

Data Buffer System

Basic Data Source

Buffered Data Source (Off-Policy)

Common Issues and Solutions

Issue: SGLang Engine Crash

Issue: Weight Sync Timeout

Issue: OOM During Training

Issue: Slow Data Loading

Supported Models

Advanced Topics

Co-location Mode

Custom Reward Model

Evaluation Multi-Task

Resources