重要前提
安装AI Skills的关键前提是:必须科学上网,且开启TUN模式,这一点至关重要,直接决定安装能否顺利完成,在此郑重提醒三遍:科学上网,科学上网,科学上网。查看完整安装教程 →
simpo-training by orchestra-research/ai-research-skills
npx skills add https://github.com/orchestra-research/ai-research-skills --skill simpo-trainingSimPO 是一种无需参考模型的偏好优化方法,其性能优于 DPO 且无需参考模型。
安装:
# 创建环境
conda create -n simpo python=3.10 && conda activate simpo
# 安装 PyTorch 2.2.2
# 访问:https://pytorch.org/get-started/locally/
# 安装 alignment-handbook
git clone https://github.com/huggingface/alignment-handbook.git
cd alignment-handbook
python -m pip install .
# 安装 Flash Attention 2
python -m pip install flash-attn --no-build-isolation
训练 (Mistral 7B):
ACCELERATE_LOG_LEVEL=info accelerate launch \
--config_file accelerate_configs/deepspeed_zero3.yaml \
scripts/run_simpo.py \
training_configs/mistral-7b-base-simpo.yaml
配置 (mistral-7b-base-simpo.yaml):
# 模型
model_name_or_path: mistralai/Mistral-7B-v0.1
torch_dtype: bfloat16
# 数据集
dataset_mixer:
HuggingFaceH4/ultrafeedback_binarized: 1.0
dataset_splits:
- train_prefs
- test_prefs
# SimPO 超参数
beta: 2.0 # 奖励缩放 (2.0-10.0)
gamma_beta_ratio: 0.5 # 目标边界 (0-1)
loss_type: sigmoid # sigmoid 或 hinge
sft_weight: 0.0 # 可选的 SFT 正则化
# 训练
learning_rate: 5e-7 # 关键:3e-7 到 1e-6
num_train_epochs: 1
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
# 输出
output_dir: ./outputs/mistral-7b-simpo
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
启动训练:
accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml \
scripts/run_simpo.py training_configs/mistral-7b-base-simpo.yaml
配置 (llama3-8b-instruct-simpo.yaml):
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
dataset_mixer:
argilla/ultrafeedback-binarized-preferences-cleaned: 1.0
beta: 2.5
gamma_beta_ratio: 0.5
learning_rate: 5e-7
sft_weight: 0.1 # 添加 SFT 损失以保持能力
num_train_epochs: 1
per_device_train_batch_size: 2
gradient_accumulation_steps: 4
output_dir: ./outputs/llama3-8b-simpo
启动:
accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml \
scripts/run_simpo.py training_configs/llama3-8b-instruct-simpo.yaml
适用于数学/代码任务:
model_name_or_path: deepseek-ai/deepseek-math-7b-base
dataset_mixer:
argilla/distilabel-math-preference-dpo: 1.0
beta: 5.0 # 更高的值以获得更强的信号
gamma_beta_ratio: 0.7 # 更大的边界
learning_rate: 3e-7 # 推理任务使用较低的学习率
sft_weight: 0.0
num_train_epochs: 1
per_device_train_batch_size: 1
gradient_accumulation_steps: 16
在以下情况使用 SimPO:
算法选择:
在以下情况使用替代方案:
问题:损失发散
降低学习率:
learning_rate: 3e-7 # 从 5e-7 降低
降低 beta:
beta: 1.0 # 从 2.0 降低
问题:模型遗忘能力
添加 SFT 正则化:
sft_weight: 0.1 # 添加 SFT 损失组件
问题:偏好区分度差
增加 beta 和边界:
beta: 5.0 # 从 2.0 增加
gamma_beta_ratio: 0.8 # 从 0.5 增加
问题:训练期间内存不足
减小批次大小:
per_device_train_batch_size: 1
gradient_accumulation_steps: 16 # 保持有效批次大小
启用梯度检查点:
gradient_checkpointing: true
损失函数:关于 sigmoid 与 hinge 损失、数学公式以及何时使用每种损失,请参阅 references/loss-functions.md。
超参数调优:关于 beta、gamma、学习率选择指南以及特定模型大小的建议,请参阅 references/hyperparameters.md。
数据集准备:关于偏好数据格式、质量过滤和自定义数据集创建,请参阅 references/datasets.md。
内存优化:
每周安装量
65
代码仓库
GitHub 星标数
5.5K
首次出现
2026年2月7日
安全审计
安装于
opencode55
cursor55
codex54
gemini-cli53
github-copilot52
claude-code52
SimPO is a reference-free preference optimization method that outperforms DPO without needing a reference model.
Installation :
# Create environment
conda create -n simpo python=3.10 && conda activate simpo
# Install PyTorch 2.2.2
# Visit: https://pytorch.org/get-started/locally/
# Install alignment-handbook
git clone https://github.com/huggingface/alignment-handbook.git
cd alignment-handbook
python -m pip install .
# Install Flash Attention 2
python -m pip install flash-attn --no-build-isolation
Training (Mistral 7B):
ACCELERATE_LOG_LEVEL=info accelerate launch \
--config_file accelerate_configs/deepspeed_zero3.yaml \
scripts/run_simpo.py \
training_configs/mistral-7b-base-simpo.yaml
Config (mistral-7b-base-simpo.yaml):
# Model
model_name_or_path: mistralai/Mistral-7B-v0.1
torch_dtype: bfloat16
# Dataset
dataset_mixer:
HuggingFaceH4/ultrafeedback_binarized: 1.0
dataset_splits:
- train_prefs
- test_prefs
# SimPO hyperparameters
beta: 2.0 # Reward scaling (2.0-10.0)
gamma_beta_ratio: 0.5 # Target margin (0-1)
loss_type: sigmoid # sigmoid or hinge
sft_weight: 0.0 # Optional SFT regularization
# Training
learning_rate: 5e-7 # Critical: 3e-7 to 1e-6
num_train_epochs: 1
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
# Output
output_dir: ./outputs/mistral-7b-simpo
Launch training :
accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml \
scripts/run_simpo.py training_configs/mistral-7b-base-simpo.yaml
Config (llama3-8b-instruct-simpo.yaml):
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
dataset_mixer:
argilla/ultrafeedback-binarized-preferences-cleaned: 1.0
beta: 2.5
gamma_beta_ratio: 0.5
learning_rate: 5e-7
sft_weight: 0.1 # Add SFT loss to preserve capabilities
num_train_epochs: 1
per_device_train_batch_size: 2
gradient_accumulation_steps: 4
output_dir: ./outputs/llama3-8b-simpo
Launch :
accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml \
scripts/run_simpo.py training_configs/llama3-8b-instruct-simpo.yaml
For math/code tasks :
model_name_or_path: deepseek-ai/deepseek-math-7b-base
dataset_mixer:
argilla/distilabel-math-preference-dpo: 1.0
beta: 5.0 # Higher for stronger signal
gamma_beta_ratio: 0.7 # Larger margin
learning_rate: 3e-7 # Lower LR for reasoning
sft_weight: 0.0
num_train_epochs: 1
per_device_train_batch_size: 1
gradient_accumulation_steps: 16
Use SimPO when :
Algorithm selection :
Use alternatives instead :
Issue: Loss divergence
Reduce learning rate:
learning_rate: 3e-7 # Reduce from 5e-7
Reduce beta:
beta: 1.0 # Reduce from 2.0
Issue: Model forgets capabilities
Add SFT regularization:
sft_weight: 0.1 # Add SFT loss component
Issue: Poor preference separation
Increase beta and margin:
beta: 5.0 # Increase from 2.0
gamma_beta_ratio: 0.8 # Increase from 0.5
Issue: OOM during training
Reduce batch size:
per_device_train_batch_size: 1
gradient_accumulation_steps: 16 # Maintain effective batch
Enable gradient checkpointing:
gradient_checkpointing: true
Loss functions : See references/loss-functions.md for sigmoid vs hinge loss, mathematical formulations, and when to use each.
Hyperparameter tuning : See references/hyperparameters.md for beta, gamma, learning rate selection guide, and model-size-specific recommendations.
Dataset preparation : See references/datasets.md for preference data formats, quality filtering, and custom dataset creation.
Memory optimization :
Weekly Installs
65
Repository
GitHub Stars
5.5K
First Seen
Feb 7, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
opencode55
cursor55
codex54
gemini-cli53
github-copilot52
claude-code52
超能力技能使用指南:AI助手技能调用优先级与工作流程详解
53,700 周安装
Playwright Network HAR 录制器 - 录制回放网络流量,生成API模拟,提升测试确定性
Cubox集成技能:通过Python脚本和Open API自动保存网页与笔记到Cubox
GitHub Agentic Workflows 听写指令:AI驱动工作流语音输入技术指南
GraphQL Schema Stitching & Federation Agent - Apollo Federation v2 超图组合与验证工具
GraphQL Schema Stitcher:Apollo Federation v2 模式缝合工具,统一联邦网关
GraphQL Schema Introspector - 模式自省、查询复杂度分析与API差异报告工具