huggingface-accelerate by davila7/claude-code-templates
npx skills add https://github.com/davila7/claude-code-templates --skill huggingface-accelerateAccelerate 将分布式训练简化为 4 行代码。
安装:
pip install accelerate
转换 PyTorch 脚本(4 行):
import torch
+ from accelerate import Accelerator
+ accelerator = Accelerator()
model = torch.nn.Transformer()
optimizer = torch.optim.Adam(model.parameters())
dataloader = torch.utils.data.DataLoader(dataset)
+ model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
for batch in dataloader:
optimizer.zero_grad()
loss = model(batch)
- loss.backward()
+ accelerator.backward(loss)
optimizer.step()
运行(单条命令):
accelerate launch train.py
原始脚本:
# train.py
import torch
model = torch.nn.Linear(10, 2).to('cuda')
optimizer = torch.optim.Adam(model.parameters())
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
for epoch in range(10):
for batch in dataloader:
batch = batch.to('cuda')
optimizer.zero_grad()
loss = model(batch).mean()
loss.backward()
optimizer.step()
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
使用 Accelerate(添加 4 行):
# train.py
import torch
from accelerate import Accelerator # +1
accelerator = Accelerator() # +2
model = torch.nn.Linear(10, 2)
optimizer = torch.optim.Adam(model.parameters())
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader) # +3
for epoch in range(10):
for batch in dataloader:
# 无需 .to('cuda') - 自动处理!
optimizer.zero_grad()
loss = model(batch).mean()
accelerator.backward(loss) # +4
optimizer.step()
配置(交互式):
accelerate config
问题:
启动(适用于任何设置):
# 单 GPU
accelerate launch train.py
# 多 GPU(8 个 GPU)
accelerate launch --multi_gpu --num_processes 8 train.py
# 多节点
accelerate launch --multi_gpu --num_processes 16 \
--num_machines 2 --machine_rank 0 \
--main_process_ip $MASTER_ADDR \
train.py
启用 FP16/BF16:
from accelerate import Accelerator
# FP16(带梯度缩放)
accelerator = Accelerator(mixed_precision='fp16')
# BF16(无缩放,更稳定)
accelerator = Accelerator(mixed_precision='bf16')
# FP8(H100+)
accelerator = Accelerator(mixed_precision='fp8')
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
# 其他一切都是自动的!
for batch in dataloader:
with accelerator.autocast(): # 可选,自动完成
loss = model(batch)
accelerator.backward(loss)
启用 DeepSpeed ZeRO-2:
from accelerate import Accelerator
accelerator = Accelerator(
mixed_precision='bf16',
deepspeed_plugin={
"zero_stage": 2, # ZeRO-2
"offload_optimizer": False,
"gradient_accumulation_steps": 4
}
)
# 代码与之前相同!
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
或通过配置:
accelerate config
# 选择:DeepSpeed → ZeRO-2
deepspeed_config.json:
{
"fp16": {"enabled": false},
"bf16": {"enabled": true},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {"device": "cpu"},
"allgather_bucket_size": 5e8,
"reduce_bucket_size": 5e8
}
}
启动:
accelerate launch --config_file deepspeed_config.json train.py
启用 FSDP:
from accelerate import Accelerator, FullyShardedDataParallelPlugin
fsdp_plugin = FullyShardedDataParallelPlugin(
sharding_strategy="FULL_SHARD", # 相当于 ZeRO-3
auto_wrap_policy="TRANSFORMER_AUTO_WRAP",
cpu_offload=False
)
accelerator = Accelerator(
mixed_precision='bf16',
fsdp_plugin=fsdp_plugin
)
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
或通过配置:
accelerate config
# 选择:FSDP → 全分片 → 无 CPU 卸载
累积梯度:
from accelerate import Accelerator
accelerator = Accelerator(gradient_accumulation_steps=4)
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
for batch in dataloader:
with accelerator.accumulate(model): # 处理累积
optimizer.zero_grad()
loss = model(batch)
accelerator.backward(loss)
optimizer.step()
有效批次大小:batch_size * num_gpus * gradient_accumulation_steps
在以下情况使用 Accelerate:
主要优势:
改用替代方案的情况:
问题:设备放置错误
不要手动移动到设备:
# 错误
batch = batch.to('cuda')
# 正确
# Accelerate 在 prepare() 后自动处理
问题:梯度累积不工作
使用上下文管理器:
# 正确
with accelerator.accumulate(model):
optimizer.zero_grad()
accelerator.backward(loss)
optimizer.step()
问题:分布式环境中的检查点保存
使用 accelerator 方法:
# 仅在主进程上保存
if accelerator.is_main_process:
accelerator.save_state('checkpoint/')
# 在所有进程上加载
accelerator.load_state('checkpoint/')
问题:使用 FSDP 时结果不同
确保相同的随机种子:
from accelerate.utils import set_seed
set_seed(42)
Megatron 集成:有关张量并行、流水线并行和序列并行设置,请参阅 references/megatron-integration.md。
自定义插件:有关创建自定义分布式插件和高级配置,请参阅 references/custom-plugins.md。
性能调优:有关性能分析、内存优化和最佳实践,请参阅 references/performance.md。
启动器要求:
torch.distributed.run(内置)deepspeed(pip install deepspeed)每周安装量
189
仓库
GitHub 星标数
22.6K
首次出现
2026 年 1 月 21 日
安全审计
安装于
opencode151
claude-code148
gemini-cli142
cursor133
codex130
github-copilot116
Accelerate simplifies distributed training to 4 lines of code.
Installation :
pip install accelerate
Convert PyTorch script (4 lines):
import torch
+ from accelerate import Accelerator
+ accelerator = Accelerator()
model = torch.nn.Transformer()
optimizer = torch.optim.Adam(model.parameters())
dataloader = torch.utils.data.DataLoader(dataset)
+ model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
for batch in dataloader:
optimizer.zero_grad()
loss = model(batch)
- loss.backward()
+ accelerator.backward(loss)
optimizer.step()
Run (single command):
accelerate launch train.py
Original script :
# train.py
import torch
model = torch.nn.Linear(10, 2).to('cuda')
optimizer = torch.optim.Adam(model.parameters())
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
for epoch in range(10):
for batch in dataloader:
batch = batch.to('cuda')
optimizer.zero_grad()
loss = model(batch).mean()
loss.backward()
optimizer.step()
With Accelerate (4 lines added):
# train.py
import torch
from accelerate import Accelerator # +1
accelerator = Accelerator() # +2
model = torch.nn.Linear(10, 2)
optimizer = torch.optim.Adam(model.parameters())
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader) # +3
for epoch in range(10):
for batch in dataloader:
# No .to('cuda') needed - automatic!
optimizer.zero_grad()
loss = model(batch).mean()
accelerator.backward(loss) # +4
optimizer.step()
Configure (interactive):
accelerate config
Questions :
Launch (works on any setup):
# Single GPU
accelerate launch train.py
# Multi-GPU (8 GPUs)
accelerate launch --multi_gpu --num_processes 8 train.py
# Multi-node
accelerate launch --multi_gpu --num_processes 16 \
--num_machines 2 --machine_rank 0 \
--main_process_ip $MASTER_ADDR \
train.py
Enable FP16/BF16 :
from accelerate import Accelerator
# FP16 (with gradient scaling)
accelerator = Accelerator(mixed_precision='fp16')
# BF16 (no scaling, more stable)
accelerator = Accelerator(mixed_precision='bf16')
# FP8 (H100+)
accelerator = Accelerator(mixed_precision='fp8')
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
# Everything else is automatic!
for batch in dataloader:
with accelerator.autocast(): # Optional, done automatically
loss = model(batch)
accelerator.backward(loss)
Enable DeepSpeed ZeRO-2 :
from accelerate import Accelerator
accelerator = Accelerator(
mixed_precision='bf16',
deepspeed_plugin={
"zero_stage": 2, # ZeRO-2
"offload_optimizer": False,
"gradient_accumulation_steps": 4
}
)
# Same code as before!
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
Or via config :
accelerate config
# Select: DeepSpeed → ZeRO-2
deepspeed_config.json :
{
"fp16": {"enabled": false},
"bf16": {"enabled": true},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {"device": "cpu"},
"allgather_bucket_size": 5e8,
"reduce_bucket_size": 5e8
}
}
Launch :
accelerate launch --config_file deepspeed_config.json train.py
Enable FSDP :
from accelerate import Accelerator, FullyShardedDataParallelPlugin
fsdp_plugin = FullyShardedDataParallelPlugin(
sharding_strategy="FULL_SHARD", # ZeRO-3 equivalent
auto_wrap_policy="TRANSFORMER_AUTO_WRAP",
cpu_offload=False
)
accelerator = Accelerator(
mixed_precision='bf16',
fsdp_plugin=fsdp_plugin
)
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
Or via config :
accelerate config
# Select: FSDP → Full Shard → No CPU Offload
Accumulate gradients :
from accelerate import Accelerator
accelerator = Accelerator(gradient_accumulation_steps=4)
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
for batch in dataloader:
with accelerator.accumulate(model): # Handles accumulation
optimizer.zero_grad()
loss = model(batch)
accelerator.backward(loss)
optimizer.step()
Effective batch size : batch_size * num_gpus * gradient_accumulation_steps
Use Accelerate when :
Key advantages :
Use alternatives instead :
Issue: Wrong device placement
Don't manually move to device:
# WRONG
batch = batch.to('cuda')
# CORRECT
# Accelerate handles it automatically after prepare()
Issue: Gradient accumulation not working
Use context manager:
# CORRECT
with accelerator.accumulate(model):
optimizer.zero_grad()
accelerator.backward(loss)
optimizer.step()
Issue: Checkpointing in distributed
Use accelerator methods:
# Save only on main process
if accelerator.is_main_process:
accelerator.save_state('checkpoint/')
# Load on all processes
accelerator.load_state('checkpoint/')
Issue: Different results with FSDP
Ensure same random seed:
from accelerate.utils import set_seed
set_seed(42)
Megatron integration : See references/megatron-integration.md for tensor parallelism, pipeline parallelism, and sequence parallelism setup.
Custom plugins : See references/custom-plugins.md for creating custom distributed plugins and advanced configuration.
Performance tuning : See references/performance.md for profiling, memory optimization, and best practices.
Launcher requirements :
torch.distributed.run (built-in)deepspeed (pip install deepspeed)Weekly Installs
189
Repository
GitHub Stars
22.6K
First Seen
Jan 21, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
opencode151
claude-code148
gemini-cli142
cursor133
codex130
github-copilot116
React 组合模式指南:Vercel 组件架构最佳实践,提升代码可维护性
109,600 周安装