重要前提
安装AI Skills的关键前提是:必须科学上网,且开启TUN模式,这一点至关重要,直接决定安装能否顺利完成,在此郑重提醒三遍:科学上网,科学上网,科学上网。查看完整安装教程 →
slime-rl-training by orchestra-research/ai-research-skills
npx skills add https://github.com/orchestra-research/ai-research-skills --skill slime-rl-trainingslime 是清华大学 THUDM 团队开发的 LLM 后训练框架,为 GLM-4.5、GLM-4.6 和 GLM-4.7 提供支持。它连接 Megatron-LM 进行训练,并使用 SGLang 进行高吞吐量的 rollout 生成。
在以下情况下选择 slime:
在以下情况下考虑替代方案:
┌─────────────────────────────────────────────────────────┐
│ 数据缓冲区 │
│ - 提示词初始化和管理 │
│ - 自定义数据生成和过滤 │
│ - Rollout 样本存储 │
└─────────────┬───────────────────────────┬───────────────┘
│ │
┌─────────────▼───────────┐ ┌─────────────▼───────────────┐
│ 训练 (Megatron-LM) │ │ Rollout (SGLang + 路由器) │
│ - Actor 模型训练 │ │ - 响应生成 │
│ - Critic (可选) │ │ - 奖励/验证器输出 │
│ - 权重同步到 rollout │ │ - 多轮对话支持 │
└─────────────────────────┘ └─────────────────────────────┘
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
# 推荐:Docker
docker pull slimerl/slime:latest
docker run --rm --gpus all --ipc=host --shm-size=16g \
-it slimerl/slime:latest /bin/bash
# 容器内
cd /root/slime && pip install -e . --no-deps
git clone https://github.com/THUDM/slime.git
cd slime
pip install -r requirements.txt
pip install -e .
# 源模型配置
source scripts/models/qwen3-4B.sh
# 启动训练
python train.py \
--actor-num-nodes 1 \
--actor-num-gpus-per-node 4 \
--rollout-num-gpus 4 \
--advantage-estimator grpo \
--use-kl-loss --kl-loss-coef 0.001 \
--rollout-batch-size 32 \
--n-samples-per-prompt 8 \
--global-batch-size 256 \
--num-rollout 3000 \
--prompt-data /path/to/data.jsonl \
${MODEL_ARGS[@]} ${CKPT_ARGS[@]}
使用此工作流训练具有组相对优势的推理模型。
# data.jsonl 格式
{"prompt": "What is 2 + 2?", "label": "4"}
{"prompt": "Solve: 3x = 12", "label": "x = 4"}
或使用聊天格式:
{
"prompt": [
{"role": "system", "content": "You are a math tutor."},
{"role": "user", "content": "What is 15 + 27?"}
],
"label": "42"
}
选择一个预配置的模型脚本:
# 列出可用模型
ls scripts/models/
# glm4-9B.sh, qwen3-4B.sh, qwen3-30B-A3B.sh, deepseek-v3.sh, llama3-8B.sh, ...
# 源化你的模型
source scripts/models/qwen3-4B.sh
python train.py \
--actor-num-nodes 1 \
--actor-num-gpus-per-node 8 \
--rollout-num-gpus 8 \
--advantage-estimator grpo \
--use-kl-loss \
--kl-loss-coef 0.001 \
--prompt-data /path/to/train.jsonl \
--input-key prompt \
--label-key label \
--apply-chat-template \
--rollout-batch-size 32 \
--n-samples-per-prompt 8 \
--global-batch-size 256 \
--num-rollout 3000 \
--save-interval 100 \
--eval-interval 50 \
${MODEL_ARGS[@]}
tensorboard --logdir outputs/使用异步模式,通过重叠 rollout 和训练来提高吞吐量。
python train_async.py \
--actor-num-nodes 1 \
--actor-num-gpus-per-node 8 \
--rollout-num-gpus 8 \
--advantage-estimator grpo \
--async-buffer-size 4 \
--prompt-data /path/to/train.jsonl \
${MODEL_ARGS[@]}
--async-buffer-size 4 # 要缓冲的 rollout 数量
--update-weights-interval 2 # 每 N 个 rollout 同步一次权重
使用此工作流训练具有工具使用或多步推理能力的智能体。
# custom_generate.py
async def custom_generate(args, samples, evaluation=False):
"""带工具调用的多轮生成。"""
for sample in samples:
conversation = sample.prompt
for turn in range(args.max_turns):
# 生成响应
response = await generate_single(conversation)
# 检查工具调用
tool_call = extract_tool_call(response)
if tool_call:
tool_result = execute_tool(tool_call)
conversation.append({"role": "assistant", "content": response})
conversation.append({"role": "tool", "content": tool_result})
else:
break
sample.response = response
sample.reward = compute_reward(sample)
return samples
python train.py \
--custom-generate-function-path custom_generate.py \
--max-turns 5 \
--prompt-data /path/to/agent_data.jsonl \
${MODEL_ARGS[@]}
完整的多轮搜索示例请参见 examples/search-r1/。
slime 使用三种类型的参数:
1. Megatron 参数(直接传递):
--tensor-model-parallel-size 2
--pipeline-model-parallel-size 1
--num-layers 32
--hidden-size 4096
2. SGLang 参数(以 --sglang- 为前缀):
--sglang-mem-fraction-static 0.8
--sglang-context-length 8192
--sglang-log-level INFO
3. slime 参数:
# 资源分配
--actor-num-nodes 1
--actor-num-gpus-per-node 8
--rollout-num-gpus 8
--colocate # 在训练/推理之间共享 GPU
# 数据
--prompt-data /path/to/data.jsonl
--input-key prompt
--label-key label
# 训练循环
--num-rollout 3000
--rollout-batch-size 32
--n-samples-per-prompt 8
--global-batch-size 256
# 算法
--advantage-estimator grpo # 或:gspo, ppo, reinforce_plus_plus
--use-kl-loss
--kl-loss-coef 0.001
rollout_batch_size × n_samples_per_prompt = global_batch_size × num_steps_per_rollout
示例:32 × 8 = 256 × 1
slime 的数据缓冲区支持灵活的数据管理:
class RolloutDataSource:
def get_samples(self, num_samples):
"""从数据集中获取提示词。"""
return self.dataset.sample(num_samples)
def add_samples(self, samples):
"""生成后调用(默认无操作)。"""
pass
class RolloutDataSourceWithBuffer(RolloutDataSource):
def __init__(self):
self.buffer = []
def add_samples(self, samples):
"""存储生成的样本以供重用。"""
self.buffer.extend(samples)
def buffer_filter(self, args, buffer, num_samples):
"""自定义选择逻辑(优先级、分层等)。"""
return select_best(buffer, num_samples)
症状:推理引擎在训练中途死亡
解决方案:
# 启用容错
--use-fault-tolerance
# 增加内存分配
--sglang-mem-fraction-static 0.85
# 减小批次大小
--rollout-batch-size 16
症状:训练在 rollout 后挂起
解决方案:
# 增加同步间隔
--update-weights-interval 5
# 使用共置模式(无网络传输)
--colocate
症状:反向传播中出现 CUDA OOM
解决方案:
# 启用梯度检查点
--recompute-activations
# 减小微批次大小
--micro-batch-size 1
# 启用序列并行
--sequence-parallel
症状:GPU 在获取数据时空闲
解决方案:
# 增加数据工作线程数
--num-data-workers 4
# 使用流式数据集
--streaming-data
| 模型系列 | 配置 |
|---|---|
| GLM | GLM-4.5, GLM-4.6, GLM-4.7, GLM-Z1-9B |
| Qwen | Qwen3 (4B, 8B, 30B-A3B), Qwen3-MoE, Qwen2.5 |
| DeepSeek | V3, V3.1, R1 |
| Llama | Llama 3 (8B, 70B) |
| 其他 | Kimi K2, Moonlight-16B |
每个模型在 scripts/models/ 目录下都有预配置的脚本。
在训练和推理之间共享 GPU 以减少内存占用:
python train.py \
--colocate \
--actor-num-gpus-per-node 8 \
--sglang-mem-fraction-static 0.4 \
${MODEL_ARGS[@]}
# custom_rm.py
class CustomRewardModel:
def __init__(self, model_path):
self.model = load_model(model_path)
def compute_reward(self, prompts, responses):
inputs = self.tokenize(prompts, responses)
scores = self.model(inputs)
return scores.tolist()
--custom-rm-path custom_rm.py
--eval-prompt-data aime /path/to/aime.jsonl \
--eval-prompt-data gsm8k /path/to/gsm8k.jsonl \
--n-samples-per-eval-prompt 16
examples/ 目录下的 14 多个工作示例每周安装次数
69
仓库
GitHub 星标
5.7K
首次出现
2026年2月7日
安全审计
安装于
opencode60
codex59
cursor59
gemini-cli58
claude-code57
github-copilot57
slime is an LLM post-training framework from Tsinghua's THUDM team, powering GLM-4.5, GLM-4.6, and GLM-4.7. It connects Megatron-LM for training with SGLang for high-throughput rollout generation.
Choose slime when you need:
Consider alternatives when:
┌─────────────────────────────────────────────────────────┐
│ Data Buffer │
│ - Prompt initialization and management │
│ - Custom data generation and filtering │
│ - Rollout sample storage │
└─────────────┬───────────────────────────┬───────────────┘
│ │
┌─────────────▼───────────┐ ┌─────────────▼───────────────┐
│ Training (Megatron-LM) │ │ Rollout (SGLang + Router) │
│ - Actor model training │ │ - Response generation │
│ - Critic (optional) │ │ - Reward/verifier output │
│ - Weight sync to rollout│ │ - Multi-turn support │
└─────────────────────────┘ └─────────────────────────────┘
# Recommended: Docker
docker pull slimerl/slime:latest
docker run --rm --gpus all --ipc=host --shm-size=16g \
-it slimerl/slime:latest /bin/bash
# Inside container
cd /root/slime && pip install -e . --no-deps
git clone https://github.com/THUDM/slime.git
cd slime
pip install -r requirements.txt
pip install -e .
# Source model configuration
source scripts/models/qwen3-4B.sh
# Launch training
python train.py \
--actor-num-nodes 1 \
--actor-num-gpus-per-node 4 \
--rollout-num-gpus 4 \
--advantage-estimator grpo \
--use-kl-loss --kl-loss-coef 0.001 \
--rollout-batch-size 32 \
--n-samples-per-prompt 8 \
--global-batch-size 256 \
--num-rollout 3000 \
--prompt-data /path/to/data.jsonl \
${MODEL_ARGS[@]} ${CKPT_ARGS[@]}
Use this workflow for training reasoning models with group-relative advantages.
# data.jsonl format
{"prompt": "What is 2 + 2?", "label": "4"}
{"prompt": "Solve: 3x = 12", "label": "x = 4"}
Or with chat format:
{
"prompt": [
{"role": "system", "content": "You are a math tutor."},
{"role": "user", "content": "What is 15 + 27?"}
],
"label": "42"
}
Choose a pre-configured model script:
# List available models
ls scripts/models/
# glm4-9B.sh, qwen3-4B.sh, qwen3-30B-A3B.sh, deepseek-v3.sh, llama3-8B.sh, ...
# Source your model
source scripts/models/qwen3-4B.sh
python train.py \
--actor-num-nodes 1 \
--actor-num-gpus-per-node 8 \
--rollout-num-gpus 8 \
--advantage-estimator grpo \
--use-kl-loss \
--kl-loss-coef 0.001 \
--prompt-data /path/to/train.jsonl \
--input-key prompt \
--label-key label \
--apply-chat-template \
--rollout-batch-size 32 \
--n-samples-per-prompt 8 \
--global-batch-size 256 \
--num-rollout 3000 \
--save-interval 100 \
--eval-interval 50 \
${MODEL_ARGS[@]}
tensorboard --logdir outputs/Use async mode for higher throughput by overlapping rollout and training.
python train_async.py \
--actor-num-nodes 1 \
--actor-num-gpus-per-node 8 \
--rollout-num-gpus 8 \
--advantage-estimator grpo \
--async-buffer-size 4 \
--prompt-data /path/to/train.jsonl \
${MODEL_ARGS[@]}
--async-buffer-size 4 # Number of rollouts to buffer
--update-weights-interval 2 # Sync weights every N rollouts
Use this workflow for training agents with tool use or multi-step reasoning.
# custom_generate.py
async def custom_generate(args, samples, evaluation=False):
"""Multi-turn generation with tool calling."""
for sample in samples:
conversation = sample.prompt
for turn in range(args.max_turns):
# Generate response
response = await generate_single(conversation)
# Check for tool call
tool_call = extract_tool_call(response)
if tool_call:
tool_result = execute_tool(tool_call)
conversation.append({"role": "assistant", "content": response})
conversation.append({"role": "tool", "content": tool_result})
else:
break
sample.response = response
sample.reward = compute_reward(sample)
return samples
python train.py \
--custom-generate-function-path custom_generate.py \
--max-turns 5 \
--prompt-data /path/to/agent_data.jsonl \
${MODEL_ARGS[@]}
See examples/search-r1/ for a complete multi-turn search example.
slime uses three types of arguments:
1. Megatron Arguments (passed directly):
--tensor-model-parallel-size 2
--pipeline-model-parallel-size 1
--num-layers 32
--hidden-size 4096
2. SGLang Arguments (prefixed with --sglang-):
--sglang-mem-fraction-static 0.8
--sglang-context-length 8192
--sglang-log-level INFO
3. slime Arguments :
# Resource allocation
--actor-num-nodes 1
--actor-num-gpus-per-node 8
--rollout-num-gpus 8
--colocate # Share GPUs between training/inference
# Data
--prompt-data /path/to/data.jsonl
--input-key prompt
--label-key label
# Training loop
--num-rollout 3000
--rollout-batch-size 32
--n-samples-per-prompt 8
--global-batch-size 256
# Algorithm
--advantage-estimator grpo # or: gspo, ppo, reinforce_plus_plus
--use-kl-loss
--kl-loss-coef 0.001
rollout_batch_size × n_samples_per_prompt = global_batch_size × num_steps_per_rollout
Example: 32 × 8 = 256 × 1
slime's data buffer enables flexible data management:
class RolloutDataSource:
def get_samples(self, num_samples):
"""Fetch prompts from dataset."""
return self.dataset.sample(num_samples)
def add_samples(self, samples):
"""Called after generation (no-op by default)."""
pass
class RolloutDataSourceWithBuffer(RolloutDataSource):
def __init__(self):
self.buffer = []
def add_samples(self, samples):
"""Store generated samples for reuse."""
self.buffer.extend(samples)
def buffer_filter(self, args, buffer, num_samples):
"""Custom selection logic (prioritized, stratified, etc.)."""
return select_best(buffer, num_samples)
Symptoms : Inference engine dies mid-training
Solutions :
# Enable fault tolerance
--use-fault-tolerance
# Increase memory allocation
--sglang-mem-fraction-static 0.85
# Reduce batch size
--rollout-batch-size 16
Symptoms : Training hangs after rollout
Solutions :
# Increase sync interval
--update-weights-interval 5
# Use colocated mode (no network transfer)
--colocate
Symptoms : CUDA OOM in backward pass
Solutions :
# Enable gradient checkpointing
--recompute-activations
# Reduce micro-batch size
--micro-batch-size 1
# Enable sequence parallelism
--sequence-parallel
Symptoms : GPU idle during data fetch
Solutions :
# Increase data workers
--num-data-workers 4
# Use streaming dataset
--streaming-data
| Model Family | Configurations |
|---|---|
| GLM | GLM-4.5, GLM-4.6, GLM-4.7, GLM-Z1-9B |
| Qwen | Qwen3 (4B, 8B, 30B-A3B), Qwen3-MoE, Qwen2.5 |
| DeepSeek | V3, V3.1, R1 |
| Llama | Llama 3 (8B, 70B) |
| Others | Kimi K2, Moonlight-16B |
Each model has pre-configured scripts in scripts/models/.
Share GPUs between training and inference to reduce memory:
python train.py \
--colocate \
--actor-num-gpus-per-node 8 \
--sglang-mem-fraction-static 0.4 \
${MODEL_ARGS[@]}
# custom_rm.py
class CustomRewardModel:
def __init__(self, model_path):
self.model = load_model(model_path)
def compute_reward(self, prompts, responses):
inputs = self.tokenize(prompts, responses)
scores = self.model(inputs)
return scores.tolist()
--custom-rm-path custom_rm.py
--eval-prompt-data aime /path/to/aime.jsonl \
--eval-prompt-data gsm8k /path/to/gsm8k.jsonl \
--n-samples-per-eval-prompt 16
examples/ directory for 14+ worked examplesWeekly Installs
69
Repository
GitHub Stars
5.7K
First Seen
Feb 7, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
opencode60
codex59
cursor59
gemini-cli58
claude-code57
github-copilot57
超能力技能使用指南:AI助手技能调用优先级与工作流程详解
56,600 周安装