OpenClaw-RL训练框架：异步强化学习优化AI智能体对话策略

openclaw-rl-training by aradotso/trending-skills

247 周安装量

10 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/aradotso/trending-skills --skill openclaw-rl-training

AI/机器学习开发自动化

🇨🇳中文介绍

OpenClaw-RL 训练

技能来自 ara.so — Daily 2026 技能集合。

OpenClaw-RL 是一个完全异步的强化学习框架，它将实时的多轮对话转化为个性化 AI 智能体的训练信号。它通过 OpenClaw 将自托管模型包装为 OpenAI 兼容的 API，拦截对话，并在后台持续优化策略，而不会中断使用。它还支持针对终端、GUI、SWE 和工具调用智能体的可扩展 RL。

架构概述

四个独立的异步循环，互不阻塞：

智能体服务 — 提供 OpenClaw 兼容的 API 服务，执行 rollout
Rollout 收集 — 捕获多轮对话作为训练轨迹
PRM/Judge 评估 — 使用下一状态反馈对轮次进行评分（可选多数投票）
策略训练 — 通过 slime 或 Tinker 进行 GRPO/OPD/组合训练

安装

git clone https://github.com/Gen-Verse/OpenClaw-RL
cd OpenClaw-RL

# 安装核心依赖
pip install -r requirements.txt

# 安装 slime (训练后端)
cd slime && pip install -e . && cd ..

# 可选：安装 SGLang 以加速推理
pip install sglang

项目结构

OpenClaw-RL/
├── openclaw-rl/          # 二元 RL (GRPO) 方法
├── openclaw-opd/         # 同策略蒸馏方法
├── openclaw-combine/     # 二元 RL + OPD 组合方法
├── openclaw-test/        # 评估工具
├── terminal-rl/          # 赛道 2：终端智能体 RL
├── gui-rl/               # 赛道 2：GUI 智能体 RL
├── swe-rl/               # 赛道 2：SWE 智能体 RL
├── toolcall-rl/          # 赛道 2：工具调用智能体 RL
├── slime/                # 核心训练框架
└── openclaw/             # 运行时 / API 服务器

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

2. 同策略蒸馏 (OPD)

当下一状态揭示了有用的后见之明时，评判器会提取一个文本提示来增强提示，创建一个增强的教师模型。词元级别的对数概率差距成为一个方向性的优势信号。

3. 组合方法 (推荐)

将二元 RL 的标量监督与 OPD 的词元级别方向性信号相结合。最强大且最稳健的优化方法。

快速开始 — 个人智能体 (赛道 1)

二元 RL 启动脚本

# openclaw-rl/run_qwen3_7b_openclaw_rl.sh
export MODEL_PATH=/path/to/qwen3-7b
export DATA_PATH=/path/to/conversation/data
export CKPT_SAVE_DIR=/path/to/checkpoints

bash openclaw-rl/run_qwen3_7b_openclaw_rl.sh

export MODEL_PATH=/path/to/qwen3-7b
export JUDGE_MODEL_PATH=/path/to/judge-model
export DATA_PATH=/path/to/conversation/data

bash openclaw-opd/run_qwen3_7b_openclaw_opd.sh

组合方法 (一行命令)

# 启动组合的二元 RL + OPD
bash openclaw-combine/run_qwen3_7b_openclaw_combine.sh

配置 — 关键环境变量

# 模型配置
export MODEL_PATH=/path/to/base/model
export JUDGE_MODEL_PATH=/path/to/judge/model   # 用于 OPD
export PRM_MODEL_PATH=/path/to/prm/model       # 用于二元 RL

# 训练配置
export CKPT_SAVE_DIR=./checkpoints
export CKPT_ARGS="--save-interval 100 --save-dir $CKPT_SAVE_DIR"

# Rollout 配置
export ROLLOUT_ARGS="--rollout-batch-size 64 --num-rollouts-per-prompt 4"

# 优化器配置
export OPTIMIZER_ARGS="--lr 1e-6 --weight-decay 0.01 --adam-beta1 0.9 --adam-beta2 0.999"

# GPU 分区 (例如，8 个 GPU：4 个用于训练，4 个用于 rollout)
export TRAIN_GPUS="0,1,2,3"
export ROLLOUT_GPUS="4,5,6,7"

# LoRA (可选，减少 GPU 内存)
export LORA_ARGS="--lora-rank 64 --lora-alpha 128 --lora-dropout 0.05"

# 在任何启动脚本中添加 LoRA 参数
export LORA_ARGS="--use-lora --lora-rank 64 --lora-alpha 128"

# 示例：LoRA 二元 RL
bash openclaw-rl/run_qwen3_7b_lora_openclaw_rl.sh

自定义损失 / Rollout 函数 (插件 API)

slime 框架暴露了扩展点，无需修改核心代码：

# 自定义损失函数
--custom-loss-function-path ./my_method/custom_loss.py

# 自定义 rollout 函数
--rollout-function-path ./my_method/custom_rollout.py

# 自定义生成函数
--custom-generate-function-path ./my_method/custom_generate.py

# 自定义奖励模型
--custom-rm-path ./my_method/custom_rm.py

示例自定义损失 (TypeScript 风格的配置，Python 实现)

# my_method/custom_loss.py
import torch
from typing import Dict, Any

def compute_loss(
    policy_logits: torch.Tensor,
    reference_logits: torch.Tensor,
    rewards: torch.Tensor,
    advantages: torch.Tensor,
    config: Dict[str, Any]
) -> torch.Tensor:
    """
    自定义 GRPO 风格的损失，带有裁剪替代目标。
    """
    # 策略和参考模型之间的对数比率
    log_ratio = policy_logits - reference_logits
    ratio = torch.exp(log_ratio)
    
    clip_range = config.get("clip_range", 0.2)
    
    # PPO 风格的裁剪目标
    clipped = torch.clamp(ratio, 1 - clip_range, 1 + clip_range)
    loss = -torch.min(ratio * advantages, clipped * advantages).mean()
    
    # KL 惩罚
    kl_coeff = config.get("kl_coeff", 0.01)
    kl_penalty = kl_coeff * log_ratio.mean()
    
    return loss + kl_penalty

示例自定义奖励模型

# my_method/custom_rm.py
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

class CustomPRM:
    def __init__(self, model_path: str):
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        self.model = AutoModelForSequenceClassification.from_pretrained(
            model_path, torch_dtype=torch.bfloat16
        )
        self.model.eval()

    def score(self, prompt: str, response: str, next_state: str) -> float:
        """
        根据提示、响应和下一状态反馈对一个轮次进行评分。
        """
        combined = f"Prompt: {prompt}\nResponse: {response}\nOutcome: {next_state}"
        inputs = self.tokenizer(combined, return_tensors="pt", truncation=True, max_length=2048)
        
        with torch.no_grad():
            logits = self.model(**inputs).logits
        
        # 二元奖励：正类概率
        return torch.softmax(logits, dim=-1)[0, 1].item()


def get_reward_model(config):
    return CustomPRM(config["prm_model_path"])

在 Tinker (云端) 上部署

# 一行命令云端部署 — 混合 RL、OPD、二元 RL 均支持
export TINKER_API_KEY=$TINKER_API_KEY
export TINKER_ENDPOINT=$TINKER_ENDPOINT

# 通过 Ray 提交任务
ray job submit --address $TINKER_ENDPOINT \
  --working-dir . \
  -- bash openclaw-combine/run_qwen3_7b_openclaw_combine.sh

赛道 2 — 通用智能体 RL

export ENV_TYPE=terminal
export MAX_STEPS=20
export PARALLEL_ENVS=32   # 并行环境实例数量

bash terminal-rl/run_terminal_rl.sh

export ENV_TYPE=gui
export SCREENSHOT_BACKEND=playwright   # 或 selenium
export PARALLEL_ENVS=16

bash gui-rl/run_gui_rl.sh

工具调用智能体 RL

export ENV_TYPE=toolcall
export TOOLS_CONFIG=./toolcall-rl/tools_config.json
export PARALLEL_ENVS=64

bash toolcall-rl/run_toolcall_rl.sh

export ENV_TYPE=swe
export SWE_BENCH_PATH=/path/to/swe-bench
export PARALLEL_ENVS=8   # SWE 环境更重

bash swe-rl/run_swe_rl.sh

数据格式 — 对话轨迹

OpenClaw-RL 自动对 API 消息进行分类。自定义数据的手动格式：

{
  "session_id": "user_session_abc123",
  "turns": [
    {
      "type": "main",
      "prompt": "Help me refactor this function to use async/await",
      "response": "Here's the refactored version: ...",
      "next_state": "User accepted the change and said 'perfect, thanks!'",
      "trainable": true
    },
    {
      "type": "side",
      "prompt": "What is 2+2?",
      "response": "4",
      "trainable": false
    }
  ]
}

main 轮次：构成训练轨迹的多轮交互
side 轮次：非训练性的系统/工具轮次，从训练中排除

OpenClaw API 服务器设置

# 启动包装您模型的 OpenClaw 兼容 API 服务器
export BASE_MODEL_PATH=/path/to/your/model
export OPENCLAW_PORT=8000
export OPENCLAW_HOST=0.0.0.0

# 使用 SGLang 后端 (推荐以获得速度)
python -m openclaw.server \
  --model-path $BASE_MODEL_PATH \
  --port $OPENCLAW_PORT \
  --backend sglang \
  --enable-rl-intercept          # 启用 RL 对话捕获
  --rl-buffer-dir ./rl_buffer    # 存储捕获轨迹的位置

// 在 TypeScript 中将服务器用作 OpenAI 兼容的 API
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8000/v1",
  apiKey: process.env.OPENCLAW_API_KEY ?? "local",
});

const response = await client.chat.completions.create({
  model: "your-model-name",
  messages: [
    { role: "user", content: "Help me write a sorting algorithm" }
  ],
  stream: true,
});

for await (const chunk of response) {
  process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}

多数投票以获得稳健的 PRM 评分

# 启用多数投票以获得更稳健的奖励估计
export MAJORITY_VOTE_N=5   # 每轮评判器调用次数
export MAJORITY_VOTE_THRESHOLD=0.6

# 添加到您的启动脚本参数中：
--majority-vote-n $MAJORITY_VOTE_N \
--majority-vote-threshold $MAJORITY_VOTE_THRESHOLD

添加新方法 (贡献模式)

# 1. 创建一个新的顶级文件夹
mkdir my-new-method
cd my-new-method

# 2. 必需文件
touch README.md                           # 文档说明内容、方法、环境变量
touch run_qwen3_7b_my_method.sh          # 启动脚本
touch custom_loss.py                      # 如果需要自定义损失
touch custom_rollout.py                   # 如果需要自定义 rollout

# run_qwen3_7b_my_method.sh — 遵循现有约定
#!/bin/bash
set -e

MODEL_SIZE="7b"
MODEL_PATH=${MODEL_PATH:-/path/to/qwen3-7b}
CKPT_SAVE_DIR=${CKPT_SAVE_DIR:-./checkpoints/my-method}

CKPT_ARGS="--save-interval 50 --save-dir $CKPT_SAVE_DIR"
ROLLOUT_ARGS="--rollout-batch-size 32 --num-rollouts-per-prompt 4"
OPTIMIZER_ARGS="--lr 1e-6 --weight-decay 0.01"

ray job submit --working-dir .. -- \
  python slime/train.py \
    --model-path $MODEL_PATH \
    --custom-loss-function-path my-new-method/custom_loss.py \
    $CKPT_ARGS $ROLLOUT_ARGS $OPTIMIZER_ARGS

# 查看 Ray 仪表板
ray dashboard  # 在 http://localhost:8265 打开

# 监视检查点保存
watch -n 10 ls -la $CKPT_SAVE_DIR

# 流式传输训练日志
tail -f ./logs/training.log

export RESUME_CKPT=$CKPT_SAVE_DIR/checkpoint-500
# 添加到启动脚本：
--resume-from-checkpoint $RESUME_CKPT

评估训练好的检查点

bash openclaw-test/run_eval.sh \
  --model-path $CKPT_SAVE_DIR/checkpoint-latest \
  --eval-tasks "conversation,coding,tool-use"

Rollout + 训练期间 GPU 内存不足：

# 使用 LoRA 减少内存占用
export LORA_ARGS="--use-lora --lora-rank 32"
# 或者减少并行环境数量
export PARALLEL_ENVS=8
# 或者使用卸载
--offload-optimizer-state

异步循环落后 (缓冲区溢出)：

# 减少 rollout 批次大小或增加评判器吞吐量
export ROLLOUT_ARGS="--rollout-batch-size 16"
# 或者增加评判器工作进程数量
--num-judge-workers 4

PRM 评分都接近 0.5 (奖励崩溃)：

验证 next_state 字段包含有意义的反馈信号
检查评判器模型提示模板是否符合预期格式
尝试增加多数投票 N：--majority-vote-n 7

SGLang 服务器无法启动：

# 检查 SGLang 版本兼容性
pip install sglang==0.4.x  # 检查 slime/requirements.txt 中固定的版本
# 回退到 vLLM 后端
--backend vllm

Ray 任务提交失败：

# 首先启动 Ray 集群
ray start --head --num-gpus=$(nvidia-smi -L | wc -l)
# 然后提交任务
ray job submit --address auto -- bash run.sh

技术报告 (arXiv)
OpenClaw 插件
Slime 训练框架
Tinker 云平台
SDFT 论文 — 已集成在 openclaw-opd 中
SDPO 论文 — 已集成在 openclaw-opd 中

🇺🇸English

OpenClaw-RL Training

Skill by ara.so — Daily 2026 Skills collection.

OpenClaw-RL is a fully asynchronous reinforcement learning framework that converts live multi-turn conversations into training signals for personalized AI agents. It wraps a self-hosted model as an OpenAI-compatible API via OpenClaw, intercepts conversations, and continuously optimizes the policy in the background without interrupting usage. It also supports scalable RL for terminal, GUI, SWE, and tool-call agents.

Architecture Overview

Four independent async loops that never block each other:

Agent Serving — OpenClaw-compatible API serving rollouts
Rollout Collection — Captures multi-turn conversations as training trajectories
PRM/Judge Evaluation — Scores turns using next-state feedback (majority voting optional)
Policy Training — GRPO/OPD/Combine training via slime or Tinker

Installation

git clone https://github.com/Gen-Verse/OpenClaw-RL
cd OpenClaw-RL

# Install core dependencies
pip install -r requirements.txt

# Install slime (training backend)
cd slime && pip install -e . && cd ..

# Optional: install SGLang for fast inference
pip install sglang

Project Structure

OpenClaw-RL/
├── openclaw-rl/          # Binary RL (GRPO) method
├── openclaw-opd/         # On-Policy Distillation method
├── openclaw-combine/     # Combined Binary RL + OPD
├── openclaw-test/        # Evaluation utilities
├── terminal-rl/          # Track 2: Terminal agent RL
├── gui-rl/               # Track 2: GUI agent RL
├── swe-rl/               # Track 2: SWE agent RL
├── toolcall-rl/          # Track 2: Tool-call agent RL
├── slime/                # Core training framework
└── openclaw/             # Runtime / API server

Three Learning Paradigms

1. Binary RL (GRPO)

A Process Reward Model scores each turn from next-state feedback. Uses GRPO advantage estimation with PPO-style clipped surrogate loss.

2. On-Policy Distillation (OPD)

When next state reveals useful hindsight, a judge extracts a textual hint to augment the prompt, creating an enhanced teacher. Token-level log-probability gap becomes a directional advantage signal.

3. Combination Method (Recommended)

Merges Binary RL scalar supervision with OPD token-level directional signal. Strongest and most robust optimization.

Quick Start — Personal Agent (Track 1)

Binary RL Launch Script

# openclaw-rl/run_qwen3_7b_openclaw_rl.sh
export MODEL_PATH=/path/to/qwen3-7b
export DATA_PATH=/path/to/conversation/data
export CKPT_SAVE_DIR=/path/to/checkpoints

bash openclaw-rl/run_qwen3_7b_openclaw_rl.sh

OPD Launch Script

export MODEL_PATH=/path/to/qwen3-7b
export JUDGE_MODEL_PATH=/path/to/judge-model
export DATA_PATH=/path/to/conversation/data

bash openclaw-opd/run_qwen3_7b_openclaw_opd.sh

Combination Method (One Line)

# Launch with combined Binary RL + OPD
bash openclaw-combine/run_qwen3_7b_openclaw_combine.sh

Configuration — Key Environment Variables

# Model configuration
export MODEL_PATH=/path/to/base/model
export JUDGE_MODEL_PATH=/path/to/judge/model   # For OPD
export PRM_MODEL_PATH=/path/to/prm/model       # For Binary RL

# Training configuration
export CKPT_SAVE_DIR=./checkpoints
export CKPT_ARGS="--save-interval 100 --save-dir $CKPT_SAVE_DIR"

# Rollout configuration
export ROLLOUT_ARGS="--rollout-batch-size 64 --num-rollouts-per-prompt 4"

# Optimizer configuration
export OPTIMIZER_ARGS="--lr 1e-6 --weight-decay 0.01 --adam-beta1 0.9 --adam-beta2 0.999"

# GPU partitioning (e.g., 8 GPUs: 4 for training, 4 for rollout)
export TRAIN_GPUS="0,1,2,3"
export ROLLOUT_GPUS="4,5,6,7"

# LoRA (optional, reduces GPU memory)
export LORA_ARGS="--lora-rank 64 --lora-alpha 128 --lora-dropout 0.05"

LoRA Training

# Add LoRA args to any launch script
export LORA_ARGS="--use-lora --lora-rank 64 --lora-alpha 128"

# Example: LoRA Binary RL
bash openclaw-rl/run_qwen3_7b_lora_openclaw_rl.sh

Custom Loss / Rollout Functions (Plugin API)

The slime framework exposes extension points without modifying core code:

# Custom loss function
--custom-loss-function-path ./my_method/custom_loss.py

# Custom rollout function  
--rollout-function-path ./my_method/custom_rollout.py

# Custom generation function
--custom-generate-function-path ./my_method/custom_generate.py

# Custom reward model
--custom-rm-path ./my_method/custom_rm.py

Example Custom Loss (TypeScript-style config, Python implementation)

# my_method/custom_loss.py
import torch
from typing import Dict, Any

def compute_loss(
    policy_logits: torch.Tensor,
    reference_logits: torch.Tensor,
    rewards: torch.Tensor,
    advantages: torch.Tensor,
    config: Dict[str, Any]
) -> torch.Tensor:
    """
    Custom GRPO-style loss with clipped surrogate objective.
    """
    # Log-ratio between policy and reference
    log_ratio = policy_logits - reference_logits
    ratio = torch.exp(log_ratio)
    
    clip_range = config.get("clip_range", 0.2)
    
    # PPO-style clipped objective
    clipped = torch.clamp(ratio, 1 - clip_range, 1 + clip_range)
    loss = -torch.min(ratio * advantages, clipped * advantages).mean()
    
    # KL penalty
    kl_coeff = config.get("kl_coeff", 0.01)
    kl_penalty = kl_coeff * log_ratio.mean()
    
    return loss + kl_penalty

Example Custom Reward Model

# my_method/custom_rm.py
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

class CustomPRM:
    def __init__(self, model_path: str):
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        self.model = AutoModelForSequenceClassification.from_pretrained(
            model_path, torch_dtype=torch.bfloat16
        )
        self.model.eval()

    def score(self, prompt: str, response: str, next_state: str) -> float:
        """
        Score a turn given prompt, response, and next-state feedback.
        """
        combined = f"Prompt: {prompt}\nResponse: {response}\nOutcome: {next_state}"
        inputs = self.tokenizer(combined, return_tensors="pt", truncation=True, max_length=2048)
        
        with torch.no_grad():
            logits = self.model(**inputs).logits
        
        # Binary reward: positive class probability
        return torch.softmax(logits, dim=-1)[0, 1].item()


def get_reward_model(config):
    return CustomPRM(config["prm_model_path"])

Deploying on Tinker (Cloud)

# One-line cloud deployment — Hybrid RL, OPD, Binary RL all supported
export TINKER_API_KEY=$TINKER_API_KEY
export TINKER_ENDPOINT=$TINKER_ENDPOINT

# Submit job via Ray
ray job submit --address $TINKER_ENDPOINT \
  --working-dir . \
  -- bash openclaw-combine/run_qwen3_7b_openclaw_combine.sh

Track 2 — General Agentic RL

Terminal Agent RL

export ENV_TYPE=terminal
export MAX_STEPS=20
export PARALLEL_ENVS=32   # Number of parallel environment instances

bash terminal-rl/run_terminal_rl.sh

GUI Agent RL

export ENV_TYPE=gui
export SCREENSHOT_BACKEND=playwright   # or selenium
export PARALLEL_ENVS=16

bash gui-rl/run_gui_rl.sh

Tool-Call Agent RL

export ENV_TYPE=toolcall
export TOOLS_CONFIG=./toolcall-rl/tools_config.json
export PARALLEL_ENVS=64

bash toolcall-rl/run_toolcall_rl.sh

SWE Agent RL

export ENV_TYPE=swe
export SWE_BENCH_PATH=/path/to/swe-bench
export PARALLEL_ENVS=8   # SWE environments are heavier

bash swe-rl/run_swe_rl.sh

Data Format — Conversation Trajectories

OpenClaw-RL automatically classifies API messages. Manual format for custom data:

{
  "session_id": "user_session_abc123",
  "turns": [
    {
      "type": "main",
      "prompt": "Help me refactor this function to use async/await",
      "response": "Here's the refactored version: ...",
      "next_state": "User accepted the change and said 'perfect, thanks!'",
      "trainable": true
    },
    {
      "type": "side", 
      "prompt": "What is 2+2?",
      "response": "4",
      "trainable": false
    }
  ]
}

main turns: Multi-turn interactions that form training trajectories
side turns: Non-trainable system/utility turns excluded from training

OpenClaw API Server Setup

# Start OpenClaw-compatible API server wrapping your model
export BASE_MODEL_PATH=/path/to/your/model
export OPENCLAW_PORT=8000
export OPENCLAW_HOST=0.0.0.0

# Using SGLang backend (recommended for speed)
python -m openclaw.server \
  --model-path $BASE_MODEL_PATH \
  --port $OPENCLAW_PORT \
  --backend sglang \
  --enable-rl-intercept          # Enable conversation capture for RL
  --rl-buffer-dir ./rl_buffer    # Where to store captured trajectories



// Using the server as OpenAI-compatible API in TypeScript
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8000/v1",
  apiKey: process.env.OPENCLAW_API_KEY ?? "local",
});

const response = await client.chat.completions.create({
  model: "your-model-name",
  messages: [
    { role: "user", content: "Help me write a sorting algorithm" }
  ],
  stream: true,
});

for await (const chunk of response) {
  process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}

Majority Voting for Robust PRM Scoring

# Enable majority voting for more robust reward estimation
export MAJORITY_VOTE_N=5   # Number of judge calls per turn
export MAJORITY_VOTE_THRESHOLD=0.6

# Add to your launch script args:
--majority-vote-n $MAJORITY_VOTE_N \
--majority-vote-threshold $MAJORITY_VOTE_THRESHOLD

Adding a New Method (Contribution Pattern)

# 1. Create a new top-level folder
mkdir my-new-method
cd my-new-method

# 2. Required files
touch README.md                           # Document what, how, env vars
touch run_qwen3_7b_my_method.sh          # Launch script
touch custom_loss.py                      # If custom loss needed
touch custom_rollout.py                   # If custom rollout needed



# run_qwen3_7b_my_method.sh — follow existing conventions
#!/bin/bash
set -e

MODEL_SIZE="7b"
MODEL_PATH=${MODEL_PATH:-/path/to/qwen3-7b}
CKPT_SAVE_DIR=${CKPT_SAVE_DIR:-./checkpoints/my-method}

CKPT_ARGS="--save-interval 50 --save-dir $CKPT_SAVE_DIR"
ROLLOUT_ARGS="--rollout-batch-size 32 --num-rollouts-per-prompt 4"
OPTIMIZER_ARGS="--lr 1e-6 --weight-decay 0.01"

ray job submit --working-dir .. -- \
  python slime/train.py \
    --model-path $MODEL_PATH \
    --custom-loss-function-path my-new-method/custom_loss.py \
    $CKPT_ARGS $ROLLOUT_ARGS $OPTIMIZER_ARGS

Common Patterns

Monitor Training Progress

# View Ray dashboard
ray dashboard  # Opens at http://localhost:8265

# Watch checkpoint saves
watch -n 10 ls -la $CKPT_SAVE_DIR

# Stream training logs
tail -f ./logs/training.log

Resume from Checkpoint

export RESUME_CKPT=$CKPT_SAVE_DIR/checkpoint-500
# Add to launch script:
--resume-from-checkpoint $RESUME_CKPT

Evaluate Trained Checkpoints

bash openclaw-test/run_eval.sh \
  --model-path $CKPT_SAVE_DIR/checkpoint-latest \
  --eval-tasks "conversation,coding,tool-use"

Troubleshooting

Out of GPU memory during rollout + training:

# Use LoRA to reduce memory footprint
export LORA_ARGS="--use-lora --lora-rank 32"
# Or reduce parallel environments
export PARALLEL_ENVS=8
# Or use offloading
--offload-optimizer-state

Async loop falling behind (buffer overflow):

# Reduce rollout batch size or increase judge throughput
export ROLLOUT_ARGS="--rollout-batch-size 16"
# Or add more judge workers
--num-judge-workers 4

PRM scores all near 0.5 (reward collapse):

Verify next_state fields contain meaningful feedback signals
Check judge model prompt template matches expected format
Try increasing majority vote N: --majority-vote-n 7

SGLang server not starting:

# Check SGLang version compatibility
pip install sglang==0.4.x  # Check slime/requirements.txt for pinned version
# Fallback to vLLM backend
--backend vllm

Ray job submission fails:

# Start Ray cluster first
ray start --head --num-gpus=$(nvidia-smi -L | wc -l)
# Then submit job
ray job submit --address auto -- bash run.sh

Key References

Technical Report (arXiv)
OpenClaw Plugin
Slime Training Framework
Tinker Cloud Platform
SDFT Paper — integrated in openclaw-opd
SDPO Paper — integrated in openclaw-opd

Weekly Installs

247

Repository

aradotso/trending-skills

GitHub Stars

First Seen

6 days ago

Security Audits

Gen Agent Trust HubWarn SocketPass SnykWarn

Installed on

github-copilot246

codex246

amp246

cline246

kimi-cli246

gemini-cli246

agent-browser 浏览器自动化工具 - Vercel Labs 命令行网页操作与测试

140,500 周安装