Stable Baselines3 强化学习库使用指南：训练RL智能体、自定义环境与回调

stable-baselines3 by davila7/claude-code-templates

168 周安装量

23,400 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/davila7/claude-code-templates --skill stable-baselines3

AI/机器学习 Python Web框架自动化

🇨🇳中文介绍

Stable Baselines3

概述

Stable Baselines3 (SB3) 是一个基于 PyTorch 的库，提供了强化学习算法的可靠实现。本技能提供了使用 SB3 统一 API 训练 RL 智能体、创建自定义环境、实现回调以及优化训练工作流的全面指导。

核心功能

1. 训练 RL 智能体

基本训练模式：

import gymnasium as gym
from stable_baselines3 import PPO

# 创建环境
env = gym.make("CartPole-v1")

# 初始化智能体
model = PPO("MlpPolicy", env, verbose=1)

# 训练智能体
model.learn(total_timesteps=10000)

# 保存模型
model.save("ppo_cartpole")

# 加载模型（无需预先实例化）
model = PPO.load("ppo_cartpole", env=env)

重要说明：

total_timesteps 是一个下限值；由于批次收集，实际训练可能超过此值
使用 model.load() 作为静态方法，而非在现有实例上调用
为节省空间，回放缓冲区不随模型一起保存

算法选择： 使用 references/algorithms.md 获取详细的算法特性和选择指导。快速参考：

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

4. 用于监控和控制的回调

目的： 回调功能允许监控指标、保存检查点、实现早停和自定义训练逻辑，而无需修改核心算法。

EvalCallback : 定期评估并保存最佳模型
CheckpointCallback : 按间隔保存模型检查点
StopTrainingOnRewardThreshold : 达到目标奖励时停止训练
ProgressBarCallback : 显示带计时的训练进度

自定义回调结构：

from stable_baselines3.common.callbacks import BaseCallback

class CustomCallback(BaseCallback):
    def _on_training_start(self):
        # 在第一次 rollout 之前调用
        pass

    def _on_step(self):
        # 在每个环境步骤之后调用
        # 返回 False 以停止训练
        return True

    def _on_rollout_end(self):
        # 在 rollout 结束时调用
        pass

self.model: RL 算法实例
self.num_timesteps: 总环境步数
self.training_env: 训练环境

from stable_baselines3.common.callbacks import CallbackList

callback = CallbackList([eval_callback, checkpoint_callback, custom_callback])
model.learn(total_timesteps=10000, callback=callback)

有关完整的回调文档，请参阅 references/callbacks.md。

5. 模型持久化与检查

保存和加载：

# 保存模型
model.save("model_name")

# 保存归一化统计信息（如果使用 VecNormalize）
vec_env.save("vec_normalize.pkl")

# 加载模型
model = PPO.load("model_name", env=env)

# 加载归一化统计信息
vec_env = VecNormalize.load("vec_normalize.pkl", vec_env)

# 获取参数
params = model.get_parameters()

# 设置参数
model.set_parameters(params)

# 访问 PyTorch 状态字典
state_dict = model.policy.state_dict()

from stable_baselines3.common.evaluation import evaluate_policy

mean_reward, std_reward = evaluate_policy(
    model,
    env,
    n_eval_episodes=10,
    deterministic=True
)

from stable_baselines3.common.vec_env import VecVideoRecorder

# 用视频录制器包装环境
env = VecVideoRecorder(
    env,
    "videos/",
    record_video_trigger=lambda x: x % 2000 == 0,
    video_length=200
)

有关完整的评估和录制模板，请参阅 scripts/evaluate_agent.py。

学习率调度：

def linear_schedule(initial_value):
    def func(progress_remaining):
        # progress_remaining 从 1 到 0
        return progress_remaining * initial_value
    return func

model = PPO("MlpPolicy", env, learning_rate=linear_schedule(0.001))

多输入策略（字典观察）：

model = PPO("MultiInputPolicy", env, verbose=1)

当观察是字典时使用（例如，结合图像和传感器数据）。

事后经验回放：

from stable_baselines3 import SAC, HerReplayBuffer

model = SAC(
    "MultiInputPolicy",
    env,
    replay_buffer_class=HerReplayBuffer,
    replay_buffer_kwargs=dict(
        n_sampled_goal=4,
        goal_selection_strategy="future",
    ),
)

TensorBoard 集成：

model = PPO("MlpPolicy", env, tensorboard_log="./tensorboard/")
model.learn(total_timesteps=10000)

启动新的 RL 项目：

定义问题 : 确定观察空间、动作空间和奖励结构
选择算法 : 使用 references/algorithms.md 获取选择指导
创建/适配环境 : 如果需要，使用 scripts/custom_env_template.py
验证环境 : 训练前始终运行 check_env()
设置训练 : 使用 scripts/train_rl_agent.py 作为起始模板
添加监控 : 实现用于评估和检查点的回调
优化性能 : 考虑使用向量化环境以提高速度
评估与迭代 : 使用 scripts/evaluate_agent.py 进行评估

内存错误 : 对于离策略算法，减少 buffer_size 或使用更少的并行环境
训练缓慢 : 考虑使用 SubprocVecEnv 进行并行环境处理
训练不稳定 : 尝试不同的算法、调整超参数或检查奖励缩放
导入错误 : 确保已安装 stable_baselines3：uv pip install stable-baselines3[extra]

train_rl_agent.py: 包含最佳实践的完整训练脚本模板
evaluate_agent.py: 智能体评估和视频录制模板
custom_env_template.py: 自定义 Gym 环境模板

algorithms.md: 详细的算法比较和选择指南
custom_environments.md: 全面的自定义环境创建指南
callbacks.md: 完整的回调系统参考
vectorized_envs.md: 向量化环境用法和包装器

# 基本安装
uv pip install stable-baselines3

# 包含额外依赖项（Tensorboard 等）
uv pip install stable-baselines3[extra]

🇺🇸English

Stable Baselines3

Overview

Stable Baselines3 (SB3) is a PyTorch-based library providing reliable implementations of reinforcement learning algorithms. This skill provides comprehensive guidance for training RL agents, creating custom environments, implementing callbacks, and optimizing training workflows using SB3's unified API.

Core Capabilities

1. Training RL Agents

Basic Training Pattern:

import gymnasium as gym
from stable_baselines3 import PPO

# Create environment
env = gym.make("CartPole-v1")

# Initialize agent
model = PPO("MlpPolicy", env, verbose=1)

# Train the agent
model.learn(total_timesteps=10000)

# Save the model
model.save("ppo_cartpole")

# Load the model (without prior instantiation)
model = PPO.load("ppo_cartpole", env=env)

Important Notes:

total_timesteps is a lower bound; actual training may exceed this due to batch collection
Use model.load() as a static method, not on an existing instance
The replay buffer is NOT saved with the model to save space

Algorithm Selection: Use references/algorithms.md for detailed algorithm characteristics and selection guidance. Quick reference:

PPO/A2C : General-purpose, supports all action space types, good for multiprocessing
SAC/TD3 : Continuous control, off-policy, sample-efficient
DQN : Discrete actions, off-policy
HER : Goal-conditioned tasks

See scripts/train_rl_agent.py for a complete training template with best practices.

2. Custom Environments

Requirements: Custom environments must inherit from gymnasium.Env and implement:

__init__(): Define action_space and observation_space
reset(seed, options): Return initial observation and info dict
step(action): Return observation, reward, terminated, truncated, info
render(): Visualization (optional)
close(): Cleanup resources

Key Constraints:

Image observations must be np.uint8 in range [0, 255]
Use channel-first format when possible (channels, height, width)
SB3 normalizes images automatically by dividing by 255
Set normalize_images=False in policy_kwargs if pre-normalized
SB3 does NOT support Discrete or MultiDiscrete spaces with start!=0

Validation:

from stable_baselines3.common.env_checker import check_env

check_env(env, warn=True)

See scripts/custom_env_template.py for a complete custom environment template and references/custom_environments.md for comprehensive guidance.

3. Vectorized Environments

Purpose: Vectorized environments run multiple environment instances in parallel, accelerating training and enabling certain wrappers (frame-stacking, normalization).

Types:

DummyVecEnv : Sequential execution on current process (for lightweight environments)
SubprocVecEnv : Parallel execution across processes (for compute-heavy environments)

Quick Setup:

from stable_baselines3.common.env_util import make_vec_env

# Create 4 parallel environments
env = make_vec_env("CartPole-v1", n_envs=4, vec_env_cls=SubprocVecEnv)

model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=25000)

Off-Policy Optimization: When using multiple environments with off-policy algorithms (SAC, TD3, DQN), set gradient_steps=-1 to perform one gradient update per environment step, balancing wall-clock time and sample efficiency.

API Differences:

reset() returns only observations (info available in vec_env.reset_infos)
step() returns 4-tuple: (obs, rewards, dones, infos) not 5-tuple
Environments auto-reset after episodes
Terminal observations available via infos[env_idx]["terminal_observation"]

See references/vectorized_envs.md for detailed information on wrappers and advanced usage.

4. Callbacks for Monitoring and Control

Purpose: Callbacks enable monitoring metrics, saving checkpoints, implementing early stopping, and custom training logic without modifying core algorithms.

Common Callbacks:

EvalCallback : Evaluate periodically and save best model
CheckpointCallback : Save model checkpoints at intervals
StopTrainingOnRewardThreshold : Stop when target reward reached
ProgressBarCallback : Display training progress with timing

Custom Callback Structure:

from stable_baselines3.common.callbacks import BaseCallback

class CustomCallback(BaseCallback):
    def _on_training_start(self):
        # Called before first rollout
        pass

    def _on_step(self):
        # Called after each environment step
        # Return False to stop training
        return True

    def _on_rollout_end(self):
        # Called at end of rollout
        pass

Available Attributes:

self.model: The RL algorithm instance
self.num_timesteps: Total environment steps
self.training_env: The training environment

Chaining Callbacks:

from stable_baselines3.common.callbacks import CallbackList

callback = CallbackList([eval_callback, checkpoint_callback, custom_callback])
model.learn(total_timesteps=10000, callback=callback)

See references/callbacks.md for comprehensive callback documentation.

5. Model Persistence and Inspection

Saving and Loading:

# Save model
model.save("model_name")

# Save normalization statistics (if using VecNormalize)
vec_env.save("vec_normalize.pkl")

# Load model
model = PPO.load("model_name", env=env)

# Load normalization statistics
vec_env = VecNormalize.load("vec_normalize.pkl", vec_env)

Parameter Access:

# Get parameters
params = model.get_parameters()

# Set parameters
model.set_parameters(params)

# Access PyTorch state dict
state_dict = model.policy.state_dict()

6. Evaluation and Recording

Evaluation:

from stable_baselines3.common.evaluation import evaluate_policy

mean_reward, std_reward = evaluate_policy(
    model,
    env,
    n_eval_episodes=10,
    deterministic=True
)

Video Recording:

from stable_baselines3.common.vec_env import VecVideoRecorder

# Wrap environment with video recorder
env = VecVideoRecorder(
    env,
    "videos/",
    record_video_trigger=lambda x: x % 2000 == 0,
    video_length=200
)

See scripts/evaluate_agent.py for a complete evaluation and recording template.

7. Advanced Features

Learning Rate Schedules:

def linear_schedule(initial_value):
    def func(progress_remaining):
        # progress_remaining goes from 1 to 0
        return progress_remaining * initial_value
    return func

model = PPO("MlpPolicy", env, learning_rate=linear_schedule(0.001))

Multi-Input Policies (Dict Observations):

model = PPO("MultiInputPolicy", env, verbose=1)

Use when observations are dictionaries (e.g., combining images with sensor data).

Hindsight Experience Replay:

from stable_baselines3 import SAC, HerReplayBuffer

model = SAC(
    "MultiInputPolicy",
    env,
    replay_buffer_class=HerReplayBuffer,
    replay_buffer_kwargs=dict(
        n_sampled_goal=4,
        goal_selection_strategy="future",
    ),
)

TensorBoard Integration:

model = PPO("MlpPolicy", env, tensorboard_log="./tensorboard/")
model.learn(total_timesteps=10000)

Workflow Guidance

Starting a New RL Project:

Define the problem : Identify observation space, action space, and reward structure
Choose algorithm : Use references/algorithms.md for selection guidance
Create/adapt environment : Use scripts/custom_env_template.py if needed
Validate environment : Always run check_env() before training
Set up training : Use scripts/train_rl_agent.py as starting template
Add monitoring : Implement callbacks for evaluation and checkpointing
Optimize performance : Consider vectorized environments for speed
Evaluate and iterate : Use scripts/evaluate_agent.py for assessment

Common Issues:

Memory errors : Reduce buffer_size for off-policy algorithms or use fewer parallel environments
Slow training : Consider SubprocVecEnv for parallel environments
Unstable training : Try different algorithms, tune hyperparameters, or check reward scaling
Import errors : Ensure stable_baselines3 is installed: uv pip install stable-baselines3[extra]

Resources

scripts/

train_rl_agent.py: Complete training script template with best practices
evaluate_agent.py: Agent evaluation and video recording template
custom_env_template.py: Custom Gym environment template

references/

algorithms.md: Detailed algorithm comparison and selection guide
custom_environments.md: Comprehensive custom environment creation guide
callbacks.md: Complete callback system reference
vectorized_envs.md: Vectorized environment usage and wrappers

Installation

# Basic installation
uv pip install stable-baselines3

# With extra dependencies (Tensorboard, etc.)
uv pip install stable-baselines3[extra]

Weekly Installs

138

Repository

davila7/claude-…emplates

GitHub Stars

22.6K

First Seen

Jan 21, 2026

Security Audits

Gen Agent Trust HubWarn SocketPass SnykPass

Installed on

claude-code117

opencode112

gemini-cli106

cursor106

antigravity97

codex93

AI Elements：基于shadcn/ui的AI原生应用组件库，快速构建对话界面

62,200 周安装

Stable Baselines3 强化学习库使用指南：训练RL智能体、自定义环境与回调

🇨🇳中文介绍

Stable Baselines3

概述

核心功能

1. 训练 RL 智能体

相关 Skills

2. 自定义环境

3. 向量化环境

4. 用于监控和控制的回调

5. 模型持久化与检查

6. 评估与录制

7. 高级功能

工作流指导

资源

scripts/

references/

安装

🇺🇸English

Stable Baselines3

Overview

Core Capabilities

1. Training RL Agents

2. Custom Environments

3. Vectorized Environments

4. Callbacks for Monitoring and Control

5. Model Persistence and Inspection

6. Evaluation and Recording

7. Advanced Features

Workflow Guidance

Resources

scripts/

references/

Installation

最新 Skills